Creating Task-Specific Foundation Models with GPT-4

Introduction

→ Check all our task-specific foundation models and datasets here.

Let’s say that you want to know the sentiment of your users by analyzing their messages on your platform. You might want to know if they are going to churn or something like that. You could go the LLM route, but it is expensive. The alternative is to use a good old BERT-size model: they are small, fast, and powerful enough to solve most text classification tasks.

To proceed, you would start from a pre-trained foundation model that is trained in an unsupervised way on a diverse dataset, and then fine-tune it on your task and data - a classic transfer learning procedure. However, fine-tuning requires to label many examples in order to reach the desired level of specialization. Alternatively, you could take a model that is already trained on a sentiment analysis task, such as something trained on tweets sentiments, and fine-tune this model instead. Unfortunately, your data is qualitatively different from tweets, and what you mean by “sentiment analysis” also differs a bit. So, in truth, these specialized models are not so helpful, unless you are lucky enough for your task and data to be very similar to the pre-trained model’s task and data.

In this case, a much better solution would be to start from a model that is pre-trained on a broad definition of “sentiment analysis”, and on a variety of domains (tweets, blogs, news, reviews, emails, etc.). We call these models task-specific foundation models to highlight that they are broadly specialized to a kind of task, but domain-agnostic, and intended to be used as foundation models. Note that there already are public domain-specific foundation models and language-specific foundation models, but task-specific foundation models are lacking. Fortunately, we came up with a simple, cheap, and reliable way to create these models using LLMs, and thus efficiently filling up this specialization gap.

In the following we show how to create these task-specific foundation models from diverse data that was automatically labelled by an LLM, using sentiment analysis as an example. The resulting foundation model achieves much better transfer-learning performance than other pre-trained models, achieving in some cases >10x data efficiency, such as in the case of this financial news dataset:

Transfer learning performance on the financial_phrasebank dataset for our sentiment analysis foundation model compared to a state-of-the-art generic foundation model (e5-base-v2).

Task-Specific Foundation Model

Ok, we want to train a foundation model that is good at a particular task - in this case, sentiment analysis - and easily adaptable to all kind of domains. The simplest approach is to take existing labelled datasets for this task and train a model on a mix of these datasets. The resulting model can then be fine-tuned on a new dataset. The issue with this method is that we are limited by the diversity of public datasets. Indeed, almost all public datasets for sentiment analysis either consist of tweets, product reviews, movie reviews, or financial news. While these domains are definitely common for sentiment analysis, there is a long tail of other domains that need of sentiment analysis, and we have seen a few of them with our customers (group chat messages, internal user messages, etc.). Also, public datasets are often created for academic purpose and with preprocessing such as language filtering that harms the diversity of the data.

To create a truly domain-agnostic, task-specific foundation model, we would like to train on all kind of data, and even with slightly different meanings of “sentiment analysis”. Unfortunately, there aren’t diverse-enough labeled datasets for this purpose, but we can create one! Recent LLMs like GPT-4 or even GTP-3.5 (especially the earliest version) are very good at understanding text and can be used to label data automatically. This means that we only need to find a large and diverse dataset, write a good prompt for an LLM to automatically label the data, and then train a model on this data to obtain our task-specific foundation model.

To do so, we randomly select 300,000 text snippets from the C4 dataset, a large general-domain dataset from the web, which we annotate using GPT-3.5 with the following simple prompt:

"The goal is to create a dataset for sentiment analysis. Classify the input text as Positive, Negative, or Neutral. Return only the label. Do not return the input text or anything else.”

Note that we don’t explicitly mention what is “Positive”, “Negative”, and “Neutral" on purpose, to allow for different interpretations by the LLM. This ensures that the sentiment analysis foundation model is not overspecialized to a specific kind of sentiment analysis. We tried more complex prompts, but this simple one obtained great results. Here is a sample from the annotated dataset:

$\begin{array}{|l|l|} \hline \textbf{Text} & \textbf{Label} \\ \hline \begin{array}{l} \small\text{Another 400 charities are in danger of losing their status with the national charity} \\ \small\text{regulator for failing to make contact with them.} \end{array} & \text{Negative} \\ \hline \begin{array}{l} \small\text{Happiness is sweet persimmons wrapped with brie and ham in a buttery puffed pastry} \end{array} & \text{Positive} \\ \hline \begin{array}{l} \small\text{From your physical assets down to business data that are critical to the operation} \\ \small\text{of your business – all of it can be found within your commercial area or space} \end{array} & \text{Neutral} \\ \hline \end{array}$

Unsurprisingly, this dataset is class-imbalanced. Most documents are neutral or positive, with only 5% of them being negative:

Learning from such imbalanced data requires unnecessary computation, so we randomly remove documents from the “Neutral” and “Positive” classes to obtain a balanced dataset. In the end, we obtain 40,000 annotated sentences that you can download here.

Now, we just need to train our task-specific foundation model on this dataset. To do so, we start from the e5-base-v2 model, which is a state-of-the-art BERT-size foundation model. Then, we fine-tune its last three layers, as it proved to lead to higher performance, better model stability, and higher data-efficiency. We obtain a pretty good generic sentiment analysis model that you can download here.

Performance Evaluation

We are not interested in directly using our model to analyze sentiment, but rather using it as a foundation to be adapted to specific tasks and domains. Therefore, in order to test the transfer learning abilities of our foundation model, we use four sentiment analysis datasets from typical domains: airlinetweetSA, financial_phrasebank, amazon_en, and climate_en, and compare the performance against three alternative foundation models: e5-base-v2 (E5), e5-base-v2 fine-tuned on the SST2 general-domain sentiment analysis dataset (E5-SST2), and e5-base-v2 fine-tuned on all evaluation datasets except the one currently being evaluated (E5-Multi). All of these models were obtained by fine-tuning last three layers. Note that, while E5 is a generic foundation model, E5-SST2 and E5-Multi are already task-specific foundation models.

To evaluate the performance of each foundation model, we fine-tune it on some examples from a dataset and measure the resulting model’s performance on the corresponding test set. To simplify the fine-tuning process - and because we are only interested in comparing the performance of foundation models - we simply train a logistic regression on top of the last layer of the foundation model instead of a full fine-tuning procedure. Here is the performance we obtain for the F1-score:

$\begin{array}{|c|c|c|c|c|c|} \hline \textbf{Dataset} & \textbf{# Examples} & \textbf{E5} & \textbf{E5-SST2} & \textbf{E5-Multi} & \textbf{Ours} \\ \hline & 1 & 0.360 & 0.488 & 0.454 & \mathbf{0.540*} \\ \text{airlinetweetSA} & 5 & 0.491 & 0.574 & 0.628 & \mathbf{0.651*} \\ & 10 & 0.546 & 0.616 & 0.648 & \mathbf{0.680*} \\ & 50 & 0.675 & 0.685 & 0.703 & \mathbf{0.715*} \\ \hline & 1 & 0.234 & 0.314 & \mathbf{0.355*} & 0.327 \\ \text{amazon_en} & 5 & 0.352 & 0.407 & \mathbf{0.441*} & 0.426 \\ & 10 & 0.382 & 0.430 & 0.448 & \mathbf{0.455*} \\ & 50 & 0.437 & 0.461 & 0.484 & \mathbf{0.486*} \\ \hline & 1 & 0.444 & 0.443 & 0.453 & \mathbf{0.516*} \\ \text{climate_en} & 5 & 0.622 & 0.634 & 0.643 & \mathbf{0.667*} \\ & 10 & 0.685 & 0.681 & 0.684 & \mathbf{0.692*} \\ & 50 & 0.741 & 0.741 & 0.750 & \mathbf{0.751*} \\ \hline & 1 & 0.321 & 0.427 & 0.478 & \mathbf{0.585*} \\ \text{financial_phrasebank} & 5 & 0.485 & 0.576 & 0.643 & \mathbf{0.692*} \\ & 10 & 0.540 & 0.624 & 0.671 & \mathbf{0.732*} \\ & 50 & 0.664 & 0.695 & 0.720 & \mathbf{0.749*} \\ \hline \end{array}$

And here is a visualization these results:

Transfer learning performance of foundation models on various sentiment analysis datasets.

There are two important takeaways from these results. First, all three task-specialized foundation models (E5-SST2, E5-Multi, and our LLM-annotated-dataset model) beat the generic E5 model. This shows the utility of using task-specialized foundation models, and is not that surprising. Second, something more surprising, our LLM-annotated-dataset foundation model is clearly better than the other task-specialized foundation models. This shows the superiority of our approach to creating such a model compared to using existing dataset. The LLM used might not annotate as well as a human, but this is largely compensated for by the diversity of data found in C4 and the diversity of the “sentiment analysis” definition implied by the prompt. In the end, there is more generic knowledge about the sentiment analysis task inside our LLM-annotated dataset than inside available human-annotated datasets, even if they claim to be generic, and even if they are combined!

Let’s Fill The Gap!

We found a simple method for creating high-quality task-specific models using modern LLMs. We applied this method to create a state-of-the-art, domain-agnostic foundation model for sentiment analysis. We have released this model with an open source license, and included it in NuMind so that all of our users can efficiently create custom sentiment analysis models.

Now, given the success of this method, we think it is time to create additional task-specific foundation models for other tasks (such as toxicity detection or topic identification) and other languages. These models will allow to quickly obtain customized BERT-size models for all kind of domains, hence democratizing NLP further. Let’s fill this specialization gap!

‍

Creating Task-Specific Foundation Models with GPT-4

Introduction

Task-Specific Foundation Model

Performance Evaluation

Let’s Fill The Gap!

Related posts

NuExtract Platform: The New Information Extraction

NuExtract 2.0: Outclassing Frontier LLMs in Information Extraction