NuExtract: A Foundation Model for Structured Extraction

TLDR

We trained language models from 0.5B to 7B parameters on an LLM-generated structured-extraction dataset. The resulting models - NuExtract-tiny, NuExtract, and NuExtract-large - achieve similar or higher extraction performance than popular LLMs that are 100 times larger. You can try NuExtract here. Talk to us to get even higher performance 🙂.

Comparison of NuExtract models with popular generic LLMs in the zero-shot setting. NuExtract-large is at GPT-4o levels while being at least 100 times smaller.

Structured Extraction

Structured Extraction is the most general and versatile information extraction task. Its goal is to extract all kinds of information from a document - entities, quantities, dates, and so on - and to identify their (potentially hierarchical) relationships. The extracted information is then structured in the form of a tree, which usually follows a template (a.k.a. schema) so that it can easily be parsed to fill up a database or directly used to take automatic actions. This extraction tree is almost always expressed in JSON. Here is a toy example:

Even this simple example shows the complexity of this task compared to traditional NLP tasks. There are 5 kinds of entities and two kinds of relations (name <> purchases and type <> quantity <> cost) organized in a tree of depth 4. It would be a headache to tackle this relatively simple problem via traditional information extraction methods. Also, while this toy example only involves a paragraph and a rather shallow tree, structured extraction can involve multi-page documents and deeper trees, which are quite challenging to handle even for modern LLMs.

At NuMind, we encountered two kinds of applications for structured extraction. The first one is fairly classic - but still rather unsolved - and consists of parsing technical documents such as medical reports, legal documents, or financial reports. One up-and-rising reason to parse such documents is to create knowledge bases to power RAG solutions.

The second kind of application deals with chatbot conversations, for instance to order groceries, book a train ticket, or replace traditional forms. In each scenario, the correct information must be extracted for the conversational agents to make the appropriate API calls in real time.

Overall, structured extraction can be used to address just about any data extraction problem. In a sense, this is the holy grail of information extraction.

Using GPT-4

Historically, structured extraction was only tackled via regex or non-generative machine learning models. This limited it to simple cases such as extracting entities. Thanks to modern generative LLMs, we can now go much further and generate deep extraction trees.

Let’s see for example how we can use one of the best current LLM, GPT-4, to parse the description of a chemical reaction:

Parsing of a chemical reaction from its description. *Data provided by Iktos.ai.*

In this case, the chemical substances need to be identified, classified, and associated with their respective quantities. The durations and temperatures of the reaction also need to be extracted. We can do this with the following prompt:

Given the following JSON template and text, return a version of the JSON template filled in with the relevant data. Don't return anything besides the filled in JSON content.

{
  "reactants" : [{”name” : “” , “quantity” : “”}],
  "reagents" : [{”name” : ””, “quantity” : ””}],
  "solvents" : [{”name” : ””, “quantity” : ””}],
  "catalysts" : [{”name” : ””, “quantity” : ””}],
  "time" : [“”],
  "temperature" : [“”]
}

Input: *<input>*

Output:

Here we simply defined the template/schema by providing a sort of empty JSON output example. Here is what GPT-4 returns for our chemical reaction:

We can see that it did a good job at extracting chemical substances and their quantities, however both reagents are misclassified.

To improve performance, we could add examples to the prompt. However, such “in-context learning” quickly saturates, as we found out for entity recognition in our recent paper introducing NuNER:

Average performance of various NER models as function of training size. Dashed lines represents in-context learning, plain lines represents fine-tuning. In-context learning quickly saturates.

Additionally, GPT-4 is a massive model, expensive to use, and requires data to be shared. To solve all these issues, we need a compact task-specific foundation model.

Task-Specific Foundation Models

A task-specific foundation model is a model specialized for a generic task - such as sentiment analysis or entity recognition - but agnostic in terms of data domain and specific problem to solve. They have the advantages of being small, usable in a private setting, and being often better at the task than much larger generic foundation models.

To create a task-specific foundation model, we first need to take a diverse corpus, such as C4, and annotate it using a modern LLM with a proper prompt. The annotations are not perfect, but that's okay. We then need to fine-tune a compact generic foundation model on this partially synthetic data to obtain the task-specific model. Here is the procedure in the case of NuExtract:

NuExtract creation procedure. A generic small language model (Phi-3) is fine-tuned on synthetic data generated by an LLM (Llama 3) to obtain a model specialized in the task of structured extraction. NuExtract can either be used in a zero-shot setting, few-shot setting, or fine-tuned for a specific application.

The resulting model can then be used in a zero-shot setting or fine-tuned to solve a specific problem, which it will solve better than a large generic model would.

Template/Schema Representation

We would like our model to work in a zero-shot setting, which in this context means that it should be able to extract information from a text solely based on a template/schema. We choose to represent the schema by a sort of empty JSON such as:

{
  "reactants" : [{”name” : “” , “quantity” : “”}],
  "time" : [“”],
}

Each array is filled with an element template, and empty strings indicate the extracted fields. Note that we only output strings and ignore other JSON types as we don’t see much interest in supporting them (you can always return a number as a string). We use this template format because of its simplicity.

Note that this template format does not allow for the inclusion of field descriptions. This is because we believe that examples are more informative than descriptions. As we will see in the following, NuExtract is trained to work both in a zero-shot setting and in a pseudo few-shot setting (i.e., by adding a few output examples into the prompt).

Dataset Creation

First, we need to figure out the kind of text we want to train NuExtract on. Currently, the main needs for structured extraction are in the medical, legal, and financial domains. However, we believe that this task will benefit many more domains, so we aim for NuExtract to be as domain-agnostic as possible.

Generating such diverse text via LLMs does not give good results. Instead, we use 300k English pieces of text from the C4 dataset, a large and diverse general-domain dataset (as we did for NuNER). The idea here is that we will find something interesting to extract in most texts.

To annotate this text, we first prompt an LLM to generate a template from each piece of text. Here is the prompt we use:

!!!START Context!!!

*<text-to-annotate>*

!!!END Context!!!

Goal: Generate an information extraction dataset.

Input: Text document + instructions for annotation.

Output: 1 JSON object (schema).

Schema:
Describes the information to be extracted.
Each field should:
Be a clear and concise name representing the extracted data.
ONLY STRING TYPE ARE ALLOWED AS VALUES (it can be an array of strings, or an object with string values, or an array of objects with string values...).
NO BOOLEAN, INT, ENUM, ETC.
The schema can focus only on part of the context document, or on the whole document.

Constraints:
Extracted information should be thematically coherent and form a well-structured JSON schema with a clear relationship between fields.

*<few-shot examples>*

Note the presence of few-shot examples “text → template” which are taken from a set created by hand.

Once we have the templates, we can use the LLM to extract information according to each template. For half of the examples, we extract information from the full text, but for the other half, we remove part of the text. Removing part of the text (but keeping the original template) creates empty fields in the output, and will teach the model that it is acceptable to return an empty string when the information is not present. This form of negative sampling is a way to fight hallucinations. Here is the prompt that we use to extract information:

!!!START Context!!!

*<text-to-annotate>*

!!!END Context!!!

Goal: Extract strings from the text corresponding to the given schema.

Input: Text document + schema.

Output: 1 JSON object

Schema:
The schema describes the information to be extracted.
ONLY STRING TYPE ARE ALLOWED AS VALUES (it can be an array of strings, or an object with string values, or an array of objects with string values...).
NO BOOLEAN, INT, ENUM, ETC.
The schema can focus only on part of the context document, or on the whole document.

Output:
THE OUTPUT SHOULD FOLLOW EXACTLY THE SCHEMA.
It should respect the schema and contain the extracted information from the context document.
THE STRING SHOULD BE PRESENT EXACTLY AS IT IS IN THE CONTEXT DOCUMENT. NO PARAPHRASING ALLOWED.
If the information is NOT PRESENT in the context, return "" for empty string and [] for empty array. If the list of object is empty, return [].
Return only the information extracted as JSON. Do not output anything else or says anything else.

Information to extract:

*<schema>*

Note that this prompt pushes the LLM to only extract strings from the text (i.e., copy-pasting), and discourages the generation of original values. This is a tradeoff that we had to make to fight hallucinations.

We use this prompt with Llama 3 70B to annotate the 300k pieces of text, and then filter out examples for which the template is not followed, as well as examples for which extracted values are not found in the text. This results in 50k annotated examples. Here is one of these examples:

Typical example from C4 annotated by Llama 3 70B. 16 words, extraction depth of 5. We can see that information is missing.

This example is typical, it has a word count of 163, an extraction depth of 5, and contains missing fields. Some documents are longer, here is the word count distribution:

Distribution of the number of words across all text pieces. Most text pieces lie in the 0-200 words range.

We can see that most pieces of text are below 200 words, but there is a tail going up to 1200 words, which typically corresponds to a 2-3 page document. We voluntarily restricted this text length to avoid running into context-length issues (especially for NuExtract-tiny) - training on larger text will be future improvement. Let’s now look at the extraction-tree depth distribution:

Distribution of extraction-tree depths across all examples. We can see that most trees have a depth of 3, 4, or 5. Classification & NER have a depth of 1, and relation extraction has a depth of 2.

We can see that most extraction trees have a depth of 3, 4, or 5, but some even reach a depth of 9! To put this into perspective, classification and entity recognition only have a depth of 1, and relation extraction has a depth of 2.

Let’s now look at the diversity of the extracted information. We find that the LLM extracted more than 200k unique field names. Here is a word cloud for the 100 most common fields:

Word cloud of the 100 most common field names found by Llama 3 70B.

We can see that there is a domination of the fields “Description”, “Name”, “Location”, and “Type” - which is not surprising - along with less common fields which also make sense, like “Ingredients”, or “Artist”. To better analyze the coverage of this dataset let’s look at a feature map of the top-10k fields:

Feature map of the 10k most common field names. We average GloVe word embeddings of field names and use UMAP to reduce dimensions (dimension 2 for the position, and 3 for the color).

We see a mix of generic concepts - such as dates on the top right, contact information on top left, or dimensions on the bottom - as well as industry-specific concepts - such as nutrition on the bottom left, or health on the top left. This dataset seems to have the concept diversity that we need to train NuExtract.

This dataset is all we need to create a zero-shot model which takes a template and a text as input and returns the filled template. However, we want to do better. Indeed, our templates are rather limited: they solely rely on the field names to define the task. One way to alleviate this is to provide field descriptions or even examples of field values. The other - and much better - way is to give full input → output examples of the task in the prompt. The issue is that input texts can be long, which takes up valuable context space. We found that, surprisingly, only providing the outputs works great. This is a sort of hybrid few-shot setting.

We generate these output examples with the LLM from the outputs and templates, and include from 0 to 3 of them in each training example. This means that NuExtract can be used either in pure zero-shot mode, or with a few output examples added after the template.

That’s it, our dataset is ready. Let’s train models on it!

Base Models

Usual information extraction tasks - such as classification, entity recognition, or relationship extraction - have a relatively simple output space. This simplicity allows to use an encoder model (like a bi-directional transformer) to explicitly model output probabilities. However, in the case of structured extraction, the output space is large and complex, so we need to generate the output like we would to generate text. We can use either an encoder-decoder architecture, like T5, or a pure decoder, like a generative LLM.

An encoder-decoder architecture is likely the best choice for this task. However, these models have not been trained as extensively as recent generative LLMs. As a result, we opt to use pure decoder LLMs. We use Phi-3-mini (3.8B parameters) for NuExtract, Phi-3-small (7B parameters) for NuExtract-large, and Qwen1.5-0.5B (0.5B parameters) for NuExtract-tiny. We fine-tune these base models on our dataset.

Evaluation

As a first test, let’s try NuExtract on our toy example:

NuExtract result on toy example. All values are correct except for one that needed generation.

The result is pretty encouraging: the JSON is valid, the schema is respected, and all extracted values are correct except for one! Interestingly, the only incorrect value - the number of bikes - is the one that couldn't have been copy-pasted from the text. This makes sense as we added a “pure-extraction” constraint to fight hallucinations. Future versions of NuExtract will likely relax this constraint.

The behavior of NuExtract on this toy example is typical. We find that it always produces valid JSON expressions and has no difficulty following the template. This is a good sign and means that guided generation is not necessary with this model, which simplifies its deployment.

Ok, let’s now perform a more comprehensive assessment of the model’s performance. For this, we need a benchmark. Unfortunately, while there are many public benchmarks for classic information extraction tasks such as entity recognition, there aren’t any for the full structured-extraction task. We need to create one.

To create this benchmark, we select a set of “problems” that we think are interesting, such as parsing resumes. For each problem, we create a template, find a set of raw text, and manually extract information from these pieces of text. This benchmark allows to test both zero-shot abilities and learning abilities of models. At the time of writing this blog post, the benchmark is not finalized - we currently only have a few hundreds examples from a handful of problems - we should thus consider the following results as indicative. We intend to make this benchmark public once it is finalized.

We now need a metric. Classic JSON distances such as Tree Edit Distance are not well adapted to our case because we want to heavily penalize the model when a schema is not respected, and we do not want to penalize the model if array elements are permuted. We ended up creating a simple tree matching method that aligns extracted values (the leaves of the tree) through a recursive process, computes similarity between corresponding values through exact matching, and averages these leaf similarities to obtain a measure between 0 (trees are completely different) to 1 (trees are a perfect match). We will properly define this metric when the benchmark is made public.

Here are the zero-shot results that we obtain for NuExtract models compared to popular - and much larger - generic LLMs:

We see that NuExtract-tiny is better than GPT-3.5 while being at least 100 times smaller, that NuExtract outperforms Llama3-70B while being 35 times smaller, and that NuExtract-large is reaching GPT-4o levels while being at least 100 times smaller.

Having small language models perform as well or better than LLMs 100 times larger offers several benefits. The first one, of course, is lower inference cost. The second one is the possibility to run these small language models locally, and thus privately. The last one is that these models are easy to fine tune to achieve even higher performance.

Let’s now see how much we gain by fine-tuning NuExtract models on the chemistry problem introduced before. We perform a 5-fold cross validation on the 50 examples of the problem. Here are the results:

Comparison of NuExtract models with popular generic LLMs of the chemical extraction problem (*Data provided by Iktos.ai*). Non-hatched areas indicate zero-shot performance. Hatched areas indicate the performance increase after fine-tuning on 40 examples. Fine-tuned NuExtract models substantially outperform zero-shot GPT-4o while being at least 100 times smaller.

We can see that NuExtract-tiny, despite only having 0.5B parameters, became a bit better than GPT-4o already, and that NuExtract and NuExtract-large are just on a different level now. We could also compare with fine-tuning Llama3-70B (GPT-4o does not allow fine-tuning), but this is not a trivial task and requires good GPUs - both for training the model and for serving the resulting fine-tuned model. These results show the benefits of using small language models fine-tuned to solve structured extraction problems.

Let’s Use It!

Structured extraction is one of the main use cases of modern LLMs. NuExtract allows to perform this task at a similar - or even higher - performance than the largest LLMs while being orders of magnitude cheaper to use. We hope that this model will be as useful as possible and release it under MIT license for everyone to use. Of course, if you would like to obtain even higher performance, the best way is to use this model through NuMind, don’t hesitate to contact us about it 🙂 .

‍