A Foundation Model for Entity Recognition

Entity Recognition

Entity recognition, a.k.a. entity extraction or Named Entity Recognition (NER), is the task of detecting entity types and other kinds of concepts mentioned in a text. Entities can be things like cities, people, companies, or any other instance of human concepts. Here is an example of a piece of a legal document in which relevant entities are annotated by their type:

Legal document annotated with annotated entities (from NuMind's annotation interface).

Entity recognition is a core task of information extraction, used in many applications such as automatic news analysis, medical coding, or legal document analysis.

To tackle such a task, you could try to use a modern LLM like GPT-4 with a good prompt, and it might be enough in some cases. However, if you need to process a lot of documents, obtain better performance, or if confidentiality is an issue, you might want to use a traditional deep learning approach instead. This approach consists in taking a foundation model like BERT and fine-tuning it on hand-annotated data. The issue is that the amount of annotated data needed to reach - and eventually surpass - GPT-4 performance on a specific task is large, typically in the hundreds of annotated documents.

Two effective ways exist to decrease this amount of human-annotated data needed: first, you can annotate your data automatically using an LLM like GPT-4 (assuming confidentiality is not an issue). Second, you can use a better BERT-size foundation model. Of course, you can also mix both approaches to obtain even better performance (which is something NuMind lets you do!). This post is about the second approach: we create a foundation model that can be fine-tuned with less data than before.

RoBERTa Out-of-the-Box

Typical foundation models used to create custom entity recognizers are transformers like BERT and RoBERTa. These neural networks are composed of several layers which transform input words (or subwords) step by step into vector representations (a.k.a. embeddings). They are trained in a self-supervised way (no annotation needed) on large multi-domain corpora of text on the task of predicting missing words from their context. Here is an example:

[the, cat, ???, on, the, ???] → [sat, mat]

In order to predict a missing word you have to understand its meaning, which means that the network is forced to learn word meanings in their context. We can find some of this understanding in the last layer. For example, here is a visualization of the vectors produced by the last layer of BERT (one vector per word), where each vector is reduced to three dimensions to define a color:

Visualization of BERT embeddings. From: Introduction to Machine Learning.

When Amazon is meant to be a river, it is green, but when it is meant to be a company, it is black/brown. This seems perfect for creating an entity recognizer. For example, we can simply attach a linear classifier on top of these embeddings to predict if the word corresponds to a given concept (i.e. entity type) or not:

Neural network computing the token (a word or a subword) probabilities to be part of a particular concept. Takes a sentence as input and returns one probability for each token.

This is like attaching one binary logistic regression for each concept on top of the embeddings. For a given concept, the network aims to give a high probability to words belonging to this concept, and a low probability to others. We can then fine-tune this modified network on our annotated data or just train the linear classifier separately.

To evaluate the transfer learning capabilities of this method, we use four human-annotated entity recognition datasets: MIT Movie (12 concepts), MIT Restaurant (8 concepts), OntoNotes 5 (18 concepts), and BioNLP 2004 (5 concepts). We train models on portion of the dataset and measure the resulting model performance on the corresponding test set. We do this several times for each setting to obtain more accurate estimations. To simplify things - and because we are only interested in comparing transfer learning performance - we only train the linear classifier while all other layers remain frozen.

For the model, we use the base version of RoBERTa as it provides better performance than BERT and other similar models for this task. Here is what we get when we average the (macro) F1-scores obtained over these datasets:

Transfer learning performance using the last layer of English RoBERTa base. x-axis: for each concept, we select x training examples which include this concept.

Naturally, performance improves with the number of examples and starts to saturate at some point. This transfer learning procedure works well and constitutes our baseline.

Leveraging Human-Annotated Datasets

Using the last layer of RoBERTa seems to work, but we can do better. Indeed, while the embeddings of the last layer do contain contextual semantic information about their corresponding words, there is no reason for them to contain information about human concepts, or that this information is encoded in such a way that a linear classifier can easily access it. Information about human concepts might be buried in other layers, or not easily accessible because the network invented its own concepts which may not align well with human concepts (similar to how some human languages have concepts that do not have equivalent in other languages). What we want is a foundation model that explicitly knows about human concepts, and we want this information to be directly accessible in the last layer.

One simple idea to obtain such model is to train an entity recognizer on a dataset annotated with a large and diverse set of concepts. If the concepts are numerous and diverse enough, the resulting model should be able to easily learn new concepts (i.e., it should have good transfer learning capabilities). We can start from a pre-trained model like RoBERTa and fine-tune it on such a dataset. In a sense, this procedure forces the information about entities to surface-up in the last layer.

This is essentially what some researchers did about two years ago. They created the NER Corpus, a large dataset (16M examples, 475M tokens) containing 315 unique concepts, based on Wikipedia text. We use this dataset to fine-tune the last 6 layers of RoBERTa as it provides the best results. Here is the transfer learning performance of the resulting model compared to the performance of the base model:

Transfer learning performance of RoBERTa base, and RoBERTa base fine-tuned on NER Corpus.

We can see that the performance is a little bit better than our baseline, but nothing spectacular. One interesting thing is that this new model performs comparatively better for a very small training set (few examples per concept), while it becomes equivalent to the baseline for larger training sets. This is a behavior that should be expected as these foundation models have the same architecture, so they become equivalent if the training data is large enough.

This model improves results over the baseline, indicating we are moving in the good direction, but we can do much better. Indeed, a set of 315 entities is small compared to the multitude of human concepts. Also, this dataset lacks diversity in terms of domains because its text only comes from Wikipedia. Ideally, we would like to train our foundation model on tens of thousands, or even hundreds of thousands of unique concepts, and on a highly diverse dataset. No such dataset exists, and creating it using human annotators would be too costly. Fortunately, thanks to modern LLMs, we don’t need human annotators.

Using LLMs to Annotate Human Concepts

Modern LLMs like GPT-4 are excellent at text understanding, and they can be used to automate annotation tasks, as we did in the case of sentiment analysis. For entity recognition, things are a bit more complicated. How can we make an LLM effectively annotate human concepts? And which concepts should we choose?

One idea is to first define a list of concepts/entity types - a.k.a. ontology - and then make the LLM annotate text using this ontology. This does not give the best results. Indeed, for some reason, GPT-3.5 does not annotate so well from a pre-defined ontology. GPT-4 does a better job at it but is more expensive. Also, it is difficult to come up with an ontology that is large and diverse, even with the help of LLMs.

In the end, we find an elegant solution that solves both problems at once: instead of defining the ontology first and then annotating the data, we let the LLM figure out the ontology as it annotates, introducing new concepts on the fly. This allows us to obtain a diverse ontology (assuming the data is itself diverse), and, by not constraining the LLM to a pre-defined ontology, it also considerably improves the annotation quality. We find that we can even use GPT-3.5 in this setting and obtain a good quality dataset. Here is the prompt that we use:

The goal is to create a dataset for entity recognition. Label as many entities, concepts, and ideas as possible in the input text. Invent new entity types that may not exist in traditional NER Tasks such as more abstract concepts and ideas. Make sure the entity concept is not part of speech but something more meaningful. Avoid finding meaningless entities. Output format (separate entities with new lines, everything, including description, and entity concept is written in English):
entity from the text -|- entity concept -|- description of entity group/concept
Input: <INPUT_SENTENCE>

And here is an example of input-output using this prompt with GPT-4 (the current version of GPT-3.5 does not work well with this prompt anymore):

Input: World War II started on September 1 1939
‍Output: World War II -|- Historical Event -|- Significant events or periods in world history that had wide-reaching effects.
September 1 -|- Date -|- Specific points in time according to the Gregorian calendar.
1939 -|- Year -|- A 12-month period on the Gregorian calendar, representing a specific point in time.

As you can see, the LLM extracts sensible concepts from the sentence. In this case it would have been nice to also obtain concepts like “month” or “war”, for example, but that is already something we can work with. Note that we also ask for a description of the concepts in order to avoid any ambiguity.

We use this prompt with GPT-3.5, March 2023 version (subsequent versions are worse at this task), to annotate 160k English sentences from from the C4 dataset, a large and diverse general-domain dataset. This results in about 800k annotations (an average of 5 concepts extracted per sentence) and the creation of 80k unique concepts. Naturally, some of these concepts are more common than others. Here is the number of occurrences of each concept in the dataset, sorted from most to least common:

Concept counts, sorted from most to least common.

We can see a long-tailed distribution, close to a power law. Here is a cumulative count to see if most annotations are made with common concepts or with rare concepts:

Cumulative concept counts, sorted from most to least common concept.

We can see that the 100 most common concepts each appear more than 500 times in the dataset, and together they account for 43% of the annotations made. On the other end of the distribution, about 50k rare concepts appear only once, and they account for 5.7% of the annotations. Here are the 100 most common concepts weighted by their frequency in the dataset:

100 most common concept present in the dataset, size reflects their frequency of appearance.

We find classic concepts like “person” or “location”, but also more interesting ones like “medical procedure” or “operating system”.

While most annotations are made with common concepts, rare concepts still contribute to a decent chunk of the annotations. Here are 100 concepts appearing only once in the dataset:

Sample of concepts appearing only once in the dataset.

We can see concepts like “art studio” or “physics formula” which are sensible, but also more esoteric concepts, like “flower center color” or the mysterious “9780763677305”, which are certainly not useful.

To get a better feel of what these concepts are, here is a scatter plot where concepts are embedded in two dimensions, with the name of the most common concepts displayed:

Feature plot of the 80k unique concepts present in the dataset. We average GloVe word embeddings of concept names and use UMAP to reduce dimensions (dimension 2 for the position, and 3 for the color).

Similar concepts are grouped together here. For instance, food-related concepts are on the bottom right (in brown), and real-estate concepts on the top right (in red). We can see some overlaps. For example, “company”, “company name”, and “business” are present, even though they are almost equivalent. Still, we find 28k unique words in the set of concept names. Overall, this dataset exhibits a much higher concept diversity than any available human-labeled dataset.

In terms of annotation quality, we find mistakes, but overall the extracted concepts make sense; only a small fraction could be considered “wrong”. However, concepts that are present in some sentences are missed in other sentences. Said differently, this dataset does not contain many false positives, but it does contain false negatives. As we will see later, these false negative are not much of an issue for the performance of our foundation model.

Learning from LLM-Annotated Concepts

Ok, we now have a large and diverse annotated dataset, and we would like to use it to obtain a foundation model for entity recognition. How can we achieve that? The naive approach would be to train a classifier as we did before, by attaching a linear classifier on top of RoBERTa and fine-tuning the resulting network on our dataset. Unfortunately, this does not work well. The reason is that this dataset is quite peculiar: the number of concept is large (80 thousands), many concepts are similar, and most concepts are only used a few times.

We find a simple solution to solve these issues. In the linear layer, instead of using a different weight vector $ w_i $ for each concept, independent of other concepts, we compute $w_i=f(concept_i)$, where $ concept_i $ is the concatenated string of the concept name and its description, and $ f $ is a sentence encoder neural network:

Neural network computing the probability for each token to be part of a particular concept. This network is trained on our dataset to create the foundation model (i.e., the Tokens Encoder).

This allows the network to leverage similarities between concepts, and completely solves the issue of having numerous related concepts that are rarely seen. Such a setup could actually scale to many more unique concepts.

In practice, to learn more efficiently, we ignore concepts that are not present in a given batch of examples. For example, if we assume that only the sentence “World War II started on September 1 1939” is in a given batch, the network would only try to predict the presence of the concepts “Historical Event” and “Date” in this sentence:

$\begin{array}{|c|c|c|c|c|c|c|c|c|}\hline & \textit{World} & \textit{War} & \textit{II} & \textit{started} & \textit{on} & \textit{September} & \textit{1} & \textit{1939}\\\hline \textbf{Historical Event} & 1&1&1&0&0&0&0&0\\\hline \textbf{Date} & 0&0&0&0&0&1&1&1 \\\hline \end{array}$

That is, we do not learn from the “negatives” of other concepts. This produces drastically uncalibrated probabilities, but it does not matter because we only care about token embeddings. Similarly, the fact that the dataset contains many false negatives does not matter either: it only biases the probabilities computed and does not alter the quality of the embeddings much. In the end, this approach is quite similar to a contrastive learning procedure, and it is likely that such a procedure would obtain similar results.

We use RoBERTa for both networks and fine tune the last six layers on the dataset. We typically use batches of 32 examples. Once the training is done, we only keep the tokens-encoder network, which becomes our foundation model.

Results

Here is the transfer learning performance of our foundation model compared to previous models:

Transfer learning performance of RoBERTa base, and RoBERTa base fine-tuned on NER Corpus, and RoBERTa base fine-tuned on our dataset.

This time the improvements are massive. This new foundation model is much better than the others, both in a few-shot setting and when the number of training examples increases. We have not reached the data regime where these models become equivalent. We see a difference of about 0.1 for the F1-score, but a better way to interpret these results is to look at data efficiency. For instance, about 30 examples per concept are needed to obtain a F1-score of 0.65 for the previous models, while only 5 examples per concept are needed for our model, which means a 6x improvement in data efficiency in this regime. The data efficiency seems to increase even further with the number of training examples, which is interesting. We would need more measurements to see how high it can go.

Ok, let’s now analyze the performance in more details by examining each dataset individually. Here are the raw results:

$\begin{array}{|c|c|c|c|c|}\hline\textbf{Dataset} & \textbf{# Examples} & \textbf{RoBERTa} & \textbf{NER Corpus} & \textbf{Ours} \\\hline & 1 & 0.341 & 0.366 & \mathbf{0.419*} \\\text{BioNLP2004} & 4 & 0.445 & 0.504 & \mathbf{0.573*} \\ & 16 & 0.541 & 0.597 & \mathbf{0.664*} \\ & 64 & 0.589 & 0.622 & \mathbf{0.702*} \\\hline & 1 & 0.381 & 0.470 & \mathbf{0.478*} \\\text{MIT Movie} & 4 & 0.531 & 0.587 & \mathbf{0.631*} \\ & 16 & 0.611 & 0.645 & \mathbf{0.679*} \\ & 64 & 0.638 & 0.656 & \mathbf{0.687*} \\\hline & 1 & 0.407 & 0.376 & \mathbf{0.509*} \\\text{MIT Restaurant} & 4 & 0.650 & 0.646 & \mathbf{0.724*} \\ & 16 & 0.749 & 0.751 & \mathbf{0.793*} \\ & 64 & 0.791 & 0.779 & \mathbf{0.817*} \\\hline & 1 & 0.274 & 0.339 & \mathbf{0.398*} \\\text{OntoNotes 5.0} & 4 & 0.480 & 0.491 & \mathbf{0.612*} \\ & 16 & 0.615 & 0.575 & \mathbf{0.696*} \\ & 64 & 0.646 & 0.586 & \mathbf{0.706*} \\\hline\end{array}$

And here are the corresponding plots:

Transfer learning performance of RoBERTa base, and RoBERTa base fine-tuned on NER Corpus, and RoBERTa base fine-tuned on our dataset. Each plot corresponds to a specific dataset.

Our foundation model is substantially superior to others across all datasets and data regimes. In the best scenario (BioNLP2004 or MIT Movie compared to RoBERTa), we observe a >10x improvement in data efficiency. In the favorable scenario (on the MIT Restaurant dataset), we still see a 2x to 3x improvement.

The magnitude of improvement compared to previous models came as a bit of a surprise to us, especially given the moderate gain obtained by training on the NER Corpus. We believe that the large number of diverse concepts, the domain-diversity of the dataset, and the specific training procedure all contributed to achieving such an improvement.

Let’s Put It to Work!

Overall, this foundation model constitutes an important improvement over comparable models, allowing for the training of accurate entity recognizers with substantially less annotated data than before. We believe this model is yet another step towards making information extraction a commodity, and with this in mind, we open-source it under MIT license for everyone to use without restriction. Of course, the best way to use it is through NuMind, so don’t hesitate to reach out 🙂.

You can find the English model here, and our Multilingual model here (that we will present in a later post).

Enjoy!