NuExtract 1.5 - Multilingual, Infinite context, still small, and better than GPT-4o!

NuExtract 1.5 - Multilingual, Infinite context, still small, and better than GPT-4o!

Liam Cripwell
Machine Learning Scientist
Alexandre Constantin
Machine Learning Scientist
Etienne Bernard
Co-Founder & CEO
October 14, 2024
We introduce NuExtract 1.5, the new version of our foundation model for structured extraction. NuExtract 1.5 is multilingual, can handle arbitrarily long documents, and outperforms GPT-4o in English while being 500 times smaller. As usual, we release it under MIT license.

Why NuExtract?

Before diving into the details of what's new, let's discuss what this is all about. NuExtract is a family of small open-source models that do only one thing: they extract information from documents and return a structured output (JSON). It turns out that, because they only do this one thing, they are very good at it. For example, we find that in our English zero-shot benchmark, NuExtract 1.5 (3.8B parameters) is better than GPT-4o while being 500Ɨ smaller. Moreover, if you fine-tune NuExtract, you obtain performance that is hard to reach via prompting, even for a frontier LLM.

Such a small open-source model has thus two main advantages:

  1. You can use it privately, without sharing your data.
  2. You can fine-tune it on input-output examples to make it excel at a specific task.

If you need to get high extraction performance for a repetitive task, or if you need to process sensitive data, NuExtract is the way to go.

NuExtract 1.5 in a Nutshell

We have been pleasantly surprised by the reception of the first version of NuExtract (check for example this nice knowledge graph project via Haystack), and thus decided to push this project further. The two main requests we received were to give NuExtract:

  1. The ability to process long documents
  2. The ability to handle non-English documents

This is what we worked on - and essentially solved - for NuExtract 1.5.

In a nutshell, we created a multilingual dataset and trained the latest open-source LLMs on it. We then added an interesting ā€œcontinuationā€ functionality, which effectively gives NuExtract an infinite context size with a bounded memory footprint. Letā€™s now dive into the detailsā€¦

Multilingual Abilities

One of the most common requests we received was to give NuExtract the ability to handle languages other than English. To achieve this, we need both a multilingual dataset and a multilingual foundation model. Fortunately, Phi-3.5 mini recently made a lot of progress on that side, now handling Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, and Ukrainian. We choose Phi-3.5 mini to be the base of NuExtract.

For the training dataset we need raw documents. We take them again from the C4 dataset. We choose 50% of English documents, and 50% of documents from other languages (mainly French, German, Spanish, Italian, and Portuguese). In order for NuExtract to handle long documents properly, we also include longer documents than we did for the original NuExtract.

We need to annotate these documents, which means generating a template and an output for each document. Now there is an important question: which language should the template be in? We choose to use an English template for half the documentsā€”irrespective of their languageā€”while we use the same language as the document for the other half. That way, it will allow users to create a unique template in English when they need to process documents in multiple languages. We then use the same automatic annotating procedure as we did for NuExtract. Here is an example of a French document with an English template:

Training example with a French document and an English template.

Note that, as for the original NuExtract, this dataset is still purely extractive: we train the model to copy-paste parts of the document and not generate anything new. We intend to add abstraction/reformulation abilities in the next release.

Infinite Context

Thanks to the use of Phi-3.5 mini as base model, NuExtract 1.5 now has a context size of 128k tokens (about 200 pages), which should be enough for the vast majority of applications. Nevertheless, there is still an issue: processing long documents with such a transformer model is memory (and computationally) intensive since every token needs to attend over every other token. Here is the GPU memory needed by NuExtract when processing a sequence of a given length:

Inference memory usage of NuExtract 1.5 as function of the number of tokens in the document when the entire document is loaded in the context.

We can see that for sequences smaller than 10,000 tokens, the memory is dominated by the need to store the ~10GB model. Past 10,000 tokens, however, we enter a quadratic-scaling phase (to store the token-token attention scores). Maxing out the 128k tokens context requires 1TB of GPU memory! This means that, for sequences smaller than 10,000 tokens, a standard GPU like L4 is fine to serve NuExtract, while we would need multiple high-end expensive GPUs for longer sequences.

To solve this memory issue for long sequences we adopt an original solution: we train NuExtract to be able to extract information from a document while being given previous information. To give this ability to NuExtract 1.5, we add new examples in the dataset for which previous information is given such as:

Example of continuation extraction. The output is obtained from the text, the template, and previously extracted information. Note that the temperature value gets overwritten here. (NB: this example is for illustration only and not part of the training set).

With such examples, the model should learn to merge previous and new information. This merging is not trivial; sometimes there is conflicting information. Note that in this case, the temperature value is overwritten as the new information is more relevant.

This "continuation" ability allows us to process arbitrarily long documents by iteratively re-injecting the current state of information while processing text via a sliding context windowā€” reminiscent of recurrent neural networks. The nice part of this procedure is that the memory footprint is bounded by the window size. Here is the memory requirement for an extraction window of 10k, assuming a constant output size of about 2k tokens:

Comparison of GPU memory requirements for using NuExtract with a full extraction window and with a 10k tokens extraction window with a 2k tokens output.

We see that the memory is now less than 30GB, irrespective of the document size.

The downside of this strategy is that it requires generating the output several times, and that the performance can be degraded if the sliding window is too small (see results section). Also, this only works if the output is quite smaller than the document, which is usually the case for long documents.

Training & Results

We train Phi-3.5 mini (3.8B) on our dataset to obtain NuExtract 1.5. We also try to train a 0.5B model on this dataset, but it turns out that such a model is too small to be multilingual and have continuation abilities. We resort to train Qwen 2.5 0.5B only on English documents and without continuation examples.

English Performance

Let's first look at the performance of the trained model on our English benchmark. This benchmark is composed of 600 examples from 12 extraction problems spanning a variety of use cases. It is still an experimental benchmark at this stage but it is already useful to compare models (we plan to release it publicly when it is complete). Note that this benchmark also tests abstraction abilities that NuExtract does not have yet.

Here are the zero-shot results:

Zero-shot results on the structured extraction benchmark. Average F1-score over extraction problems. NuExtract 1.5 is better than NuExtract and slightly better than GPT-4o.

We can see that NuExtract 1.5 is quite better than the original NuExtract. Also, NuExtract is even a bit better than GPT-4o!

Let's now look at results when models are given access to input-output examples. We use the same benchmark as before and fine-tune NuExtract 1.5 on 45 examples for each of the 12 problems. We also benchmark GPT-4o by putting all 45 examples in the prompt (a.k.a. in-context learning), which is only possible because our benchmark examples are short, typically 1k tokens, which means prompts of about 50k tokens:

Many-shot results on the structured extraction benchmark. GPT-4o is a slightly better than NuExtract 1.5. Large improvement between NuExtract 1.5 and NuExtract 1.5 tiny.

As expected, all models drastically improve their performance (hatched areas). We can see that GPT-4o is now better than NuExtract 1.5, but not by much. The other important thing to note is that NuExtract 1.5 is much better than NuExtract 1.5 tiny, which hints that a bigger NuExtract could largely beat GPT-4o. To be confirmedā€¦

Overall, NuExtract 1.5 and GPT-4o show very similar performance on both zero-shot and many-shot regimes. This can be surprising that a model 500 times smaller that has no abstraction abilities rivals with such a powerful frontier model. We believe that there are three reasons for this. First, by only focusing on the task of structured extraction, NuExtract is able to re-attribute some weights to improve text understanding. Second, the training procedure is good at forcing NuExtract to follow the template precisely and to only return a JSON output. Last but not least, our training drastically reduces hallucinations by forcing the model to extract parts of the input text and training it to return empty results when necessary.

Multilingual Performance

Letā€™s now see the performance on the multilingual benchmark (which is composed of 250 documents per language, translated from part of the English benchmark):

Multilingual zero-shot results on the structured extraction benchmark. Average F1-score over extraction problems and languages. NuExtract 1.5 is better than NuExtract but still not at GPT-4o levels.

We see that NuExtract 1.5 is much better than the original NuExtract, however GPT-4o is still better in this case. We believe that model size is quite important for multilinguality (which is confirmed by the fact that we could not train NuExtract tiny to be multilingual). We are likely to fill this gap with a bigger NuExtract.

Long Documents Performance

Finally, letā€™s look at the performance on long documents. We first test documents in the 8k-10k tokens range (around 20 pages) because we can easily process them without a sliding window:

Performance on long documents (between 8k and 10k tokens). NuExtract 1.5 is much better than NuExtract 1.5 tiny and beats GPT-4o!

The results are impressively good: NuExtract 1.5 is better than GPT-4o! We should note that the benchmark in this regime is not as complete and diverse as in the case of smaller documents, but still, it shows that NuExtract 1.5 is very good at handling long documents (and it is a testament to the proper handling of long-context by Phi-3.5 mini). We also see that NuExtract 1.5 tiny is substantially worse than NuExtract 1.5, we are not sure at this point if this is simply due to the model size or due to the base model used.

Now we test even longer documents, in the 10k-20k tokens range. This time we have to set a 10k extraction window to keep memory manageable:

Performance on even longer documents (between 10k and 20k tokens). NuExtract 1.5 beats GPT-4o while only using a 10k extraction window!

Again, NuExtract 1.5 is the top-performing model even with the reduced extraction window, showing that the previous result is not a fluke. It also shows thatā€”at least for a window size of 10k tokensā€”the continuation strategy works well.

Letā€™s now analyze the performance as a function of the size of the extraction window. We again use the 8k-10k tokens benchmark:

Performance of NuExtract on long documents (8k-10k tokens) as function of the size of the extraction window. We need to go down to a 2k tokens window for NuExtract 1.5 to become worse than GPT-4o.

We can see that the performance of NuExtract 1.5 decreases as the size of the extraction window decreases, but not by much! We need to go down to a 2k tokens window for NuExtract 1.5 to become worse than GPT-4oā€”and it is still much better than the NuExtract 1.5 tiny. Using a small window reduces memory: 20GB for the full window while 10GB (and most of it is the model's weights) for the 2k window. The ratio becomes much bigger for longer sequences.

Using such a continuation procedure is not perfectā€”and there are certainly ways to improve itā€”but it avoids simply failing when the memory required is larger than the GPU's memory. Our inference module (part of our Enterprise solution, talk to us šŸ˜Š) automatically adapts the window size for a given GPU's memory..

Let's Use It!

That's it for this release. You can try NuExtract 1.5 here. We hope that you will make good use of it. Do not hesitate to give us feedback to help us improve the next versions :)

ā€

Get Started