Information extraction from 2D documents: a hybrid approach
Olivier Nguyen Olivier Nguyen
March 25 8 min

Information extraction from 2D documents: a hybrid approach

Document processing is a critical task across virtually every industry. At Element AI, we’re developing new tools to accelerate and even automate this process. We leverage state-of-the-art deep learning techniques to build systems that can precisely and rapidly extract information from digital and scanned documents. This requires models that have the ability to understand 2D layout and the semantics of text.

The heterogeneity inherent in this type of data poses many challenges. Documents differ in many ways, and the same information is often expressed in different language. Document layout can vary radically, even for a specific document type in an organization, and can change unpredictably over time. Understanding what data needs to be extracted can require inference across multiple structural elements, combining information gleaned from, for example, header sections, column labels and row items. Defining a generic modelling approach capable of performing well across a wide range of document types and industries is therefore a daunting task.

Existing approaches

Traditional approaches in natural language processing (NLP) have treated this problem as one of named entity recognition (NER), where the task is to assign spans of text to a specific category, such as a name or an address. Some of these techniques have even been implemented in production systems.

However, NER ignores the spatial organization of text, which is often meaningful in a form or other structured document. These systems frequently break as documents become more complex and require more reasoning about global context.

Pure Computer Vision (CV) methods, on the other hand, try to exploit the 2D structure of documents by processing them as images using their raw pixel values and applying detection or segmentation models to classify image pixels to a field, as shown by Chen et al and Olivera et al.

Recent approaches that have attempted to combine modern NLP and CV techniques for document extraction like BERTGrid, Attend, Copy & Parse, as well as with approaches like graph convolutions, have shown promising results in academic contexts. However, these approaches require large amounts of labelled data, are slow to train and aren’t as well-suited for a production system that needs to adapt quickly to many domains and new types of documents and layouts. While these approaches may be suitable for domains like receipts, where large amounts of labelled data can easily be obtained, they are difficult to employ in scenarios with high variability and sensitive data such as insurance forms, bills of lading or business reports.

Element AI Document Intelligence employs a hybrid approach using both deep learning and classical machine learning techniques for entity extraction. Combining state-of-the-art deep learning optical character recognition (OCR) models, transfer learning from pre-trained NLP models and rich contextual representations of documents, this method allows for a system that learns quickly and continually and generalizes to many new types of documents.

Learning to rank the best suggestions

Our entity extraction technology can identify any number of fields for any type of document. The system learns to auto-complete fields as users type. These suggestions are produced by our OCR model, which processes and reads text in a scanned image. A drop-down list of suggestions allows the user to select the best choice from an array offered by the model. An easy-to-use interface paired with the visual feedback of where the system's suggestions appear ensures that users are able to understand the system’s decisions and correct them if necessary.

For the first document of a given type, the OCR suggestions are unsorted; we have no data to indicate which is likely to be the correct candidate for a given entity. However, the models are improved by continual learning: recommendations are ranked in order of probability, and become more accurate over time through user feedback. With each document the system processes, we gain another example of what a particular entity looks like, and a large number of negative examples—words that were found on the page that were not the correct answers.

This approach requires no hand-crafting of rules to extract fields in a document and remains fault tolerant. In the worst case scenario, the system is still helping the user via autocompletion; in the best case, it's proactively suggesting the right answer before any characters have been typed. As the user gains more trust in the AI, it is possible to transition from an auto-complete to an auto-fill functionality, where the top prediction of the model is automatically selected for a given field.

Continual learning

Unlike solutions that require extensive pre-training on historical data, our approach relieves the need to acquire large amounts of data and allows models to be trained on the fly. It closes the machine learning loop with data inputs from users, meaning that our models are continually learning with a constant stream of user feedback. The system employs online learning to quickly update models whenever a new document is entered by a user.

As a result, our system is:

  • Fast to learn: It only takes a few examples for the system to learn a new type of document
  • Completely user configurable: because we don’t need to train and deploy models specific to every entity, users can configure the product for use with any document they care about. This makes it far easier for customers to trial the product without expensive upfront investment (for us or them)
  • Resistant to data drift: because we’re constantly training on human feedback, we adapt to changes in document formats or content over time

State-of-the-art OCR

Our system leverages state-of-the-art deep learning models for OCR that we develop in-house. Our OCR models are trained on large amounts of handwritten and printed text from real-world settings. The datasets that we use are open-source or owned by Element AI, and represent a wide range of domains, including receipts, invoices, reports, images taken in the wild and cleverly engineered synthetic data. These models learn from millions of examples such that they can be applied to nearly any domain for automated document processing. Our human-in-the-loop architecture allows the user to correct any transcription mistakes that our model makes. This information is then fed back to our model and used to improve its future performance.

Using pre-trained models

Recent advancements in the field of NLP has enabled larger pre-trained language models such as BERT, XLNet and GPT-2 to be fine-tuned for state-of-the-art results; a similar impact to the one deep learning models trained on ImageNet had on computer vision. These models have demonstrated that pre-training language models on large amounts of data can provide state-of-the-art performance on a variety of downstream tasks with minimal fine-tuning. Our system captures the semantics of the text in a document using the word and character embeddings of these pre-trained NLP models to help better understand how to extract fields of similar meaning. For instance, our systems can learn that a text box containing the words “Sender” and “From” have similar meaning and represent contextual cues in determining where an entity lies on the page.

Creating rich document representations

One of the challenges of a system that continually learns and improves is in ensuring that models can learn quickly without forgetting how to complete previous tasks, but also still generalize to new, unseen types of documents. To make the extraction of fields more robust to varying templates and domains, we calculate a vector representation of each document that captures rich contextual information such as commonly appearing text, the location of text on the page and surrounding fields and information. From these document representations, we can compute clusters of similar documents, which allows us to detect when a more difficult sample arrives. It also minimizes the size of our training data because we can sample the data that will most benefit the performance of the model.

Conclusion

By leveraging the techniques described above, Element AI Document Intelligence is able to rapidly learn how to extract relevant information from structured and semi-structured documents. Our system combines various types of machine learning models with a human-in-the-loop approach to continually learn from user feedback. Over time, our algorithms learn to rank suggestions and auto-complete fields. Crucial to this approach is an interface that is explainable and that allows the models to be helpful even when a rare or unseen document appears in the system.

Once information has been successfully extracted from documents, users can quickly focus on important subsequent tasks using the extracted data. Depending on need, the extracted data can serve as crucial input for future predictive and decision-making algorithms.