Document Intelligence: Accelerated document processing for any documents in any industry
Archy de Berker Archy de Berker
March 25 8 min

Document Intelligence: Accelerated document processing for any documents in any industry

The world is becoming increasingly digital, but many business processes remain clogged with paper. Across industries, technologically sophisticated companies struggle to deal with the volume and complexity of paper documents they receive from their partners and customers. A recent survey concluded that 51% of enterprise data resided in paper or unstructured digital documents (source). In financial services we see cheques and scanned mortgage application forms; retailers struggle to reconcile invoices and fulfill purchase orders; and logistics companies are inundated by the bills of lading and manifests that accompany shipments around the globe.

Documents are information silos. You cannot search, visualize or analyze data that is locked in paper, scans or faxes. Currently, digitizing your documents in a structured way—extracting the information in them into a CSV or a database—is an expensive process that can be extremely time-consuming. McKinsey estimates that a typical global financial institution spends 0.5% of revenue on handling documents (source), while retailers spend an average of around $12 for every invoice they reconcile (source). From conversations with insurers, we estimate that 30% of submissions are going unwritten because they haven’t been processed in time; slow document processing hurts profits as well as incurring losses. Little wonder businesses have turned to automation to solve this ever present challenge.

The challenge of data extraction

Data extraction from documents is a hard problem. First, the way that people express themselves varies hugely across, and even within, organizations. There is enormous variety in word choice and formatting: the same date can be written ten different ways; different organizations develop their own distinct internal vocabularies over time. Second, the way that information is organized on the page is highly variable. One of your suppliers might put their address in the top left and yours in the top right-hand corner, and another might do the opposite. Third, both of these factors vary unpredictably over time—people change the format and layout of documents without warning.

Combined, these three problems mean that rule-based solutions—solutions that rely on a predetermined template to accurately extract information from a field within a specific document —inevitably break down. Real-world documents are too variable and they change too often for template-based approaches to work well at scale. Many of our clients have experienced the pain of a new supplier’s irregular template or a slight change of format that renders their existing template-based solution useless.

The last few years have seen the rise of more sophisticated machine-learning-based approaches, based upon more flexible models that can navigate the complexity of real-world data. While these have seen some success when providers have focused upon one very specific form of document—such as a UK invoice—it can be difficult to find enough data to train these models to perform well. In situations where organizations have large amounts of cleanly labelled data, it’s possible to produce high-performing models. But enterprise data is rarely copious or clean. In fact, when we’ve explored such approaches we’ve seen persistent issues of data drift: the historical data that we train our models on ends up looking different than the data that customers see in daily use. This produces unpredictable and undesirable effects, ultimately requiring back and forth with customers to retrain models and resolve issues.

Data drift: the phenomenon whereby data changes over time. This can be problematic for machine learning models that have been trained on old data and then tested against new data: the patterns they learned in the old data might no longer be relevant. For instance, a news classification service trained on data from the 1990s might classify articles mentioning Donald Trump as belonging to “business”, whereas today they might fit better into the “politics” category.

Continual learning

The solution we’ve adopted in our Document Intelligence product to combat data drift is called continual learning. This means that instead of accumulating historical data, training models and then deploying, we deploy a system that learns in situ. This has several benefits for our clients:

  • You don’t need any historical data to work with the system, because it will learn on the data you’re actually seeing from day to day. This eliminates the problem of data drift.
  • We can deploy the system very rapidly—in as little as two weeks—because we don’t have to train models in advance. This makes it easy to test the system as a user and confirm that it’s valuable without committing to a lengthy engagement.
  • There are no minimum document volumes. With more data the system will perform better, but it still functions well on the long-tail of rare documents that other products do not support.
  • The system can work with any document you see in your organization, and you can configure it for whatever data extraction tasks arise. Because we don’t need to pre-train models, we can make the system entirely user configurable.
  • Over time, as accuracy levels increase, you can progressively transition to automation, offering even more time savings.

A novel interface for data extraction

Once we recognized the potential for continual learning to radically improve and accelerate the extraction of data from documents, we realized that we also had to rethink the interfaces that would enable this transformation.

The KPI for most of our clients is time—they want to process documents faster, allowing them to focus on higher-value tasks and more swiftly take actions such as paying invoices, underwriting a client or completing a bank transfer. After extensive user testing, we found that an autocomplete interface was the fastest way to provide suggestions from the model to help in the process of data extraction. In fact, when faced with a completely new document type that the system has never seen before, the autocomplete system alone speeds up data extraction by ~30%1. As the models behind autocomplete learn to recognize the different fields you care about in a particular class of document, these time savings continue to grow.

Autocomplete for data extraction.
Autocomplete for data extraction: combining a state-of-the-art OCR (Optical Character Recognition) system with machine-learning-powered autocomplete allows us to accelerate data extraction from Day 0, while continually improving the speed of extraction as more documents are processed.

Complete user configurability

The combination of continual learning and an autocomplete system allows us to provide a generic product that tailors itself to the needs of its users. One of the major benefits of this is that you can choose to configure the system to extract whatever fields matter to your organization.

We provide a simple setup in which you choose the fields you would like to extract and define what kind of data they represent—numbers, words, currency. As the system learns, we provide analytics to allow you to monitor the time savings the system is delivering and the accuracy of the predictions provided via the autocomplete interface. You’re effortlessly training machine learning models for your business, whilst speeding up the data entry you’d be doing anyway.

Once accuracy is high, you can trigger the system into autofill mode, gradually transitioning from user enabled document acceleration towards full automation. We know it’s important to handle this transition gradually: many of our clients have been stung by solutions which promised high degrees of automation but delivered huge piles of time-consuming exceptions.

User-configurable tasks, analytics and automation levels.
User-configurable tasks, analytics and automation levels make Document Intelligence flexible enough for any extraction task in any industry.

From data extraction to business value

Most organizations that receive scanned documents don’t simply want to extract the information from them: they want to do something afterwards. Fast, accurate, data extraction is only the first step in the chain.

Document Intelligence is designed as a module of the EAI OS, which allows us - or our partners- to combine it with other components for maximum impact. In many cases data extracted from documents requires some validation or reconciliation; verifying the customer account on a purchase order, reconciling an invoice to a warehouse check, or comparing a loan application to an applicant’s credit history. For instance, our platform for Insurance combines a Document Intelligence module with insurance-specific validation, prioritization, and decision-making capabilities. We’ll be working with some of our key partners to assemble similar toolkits for other industries.

Conclusion

Element AI’s Document Intelligence system is an industry-agnostic solution that allows you to extract data from the range of documents that are handled in your organization. The use of continual learning and a novel autocomplete interface allows us to provide a blazingly fast data-extraction solution that instantly learns from user feedback and offer businesses looking to automate data extraction for evolving documents a flexible and reliable means to do so.

1 Results derived from internal user testing