Capturing Unstructured Documents – Blog de Fermin Fernandez

In the world of document capture, documents have traditionally been divided into three types: structured, semi-structured, and unstructured.

More than 20 years ago, the first capture programs focused on extracting information from structured documents, where the position of each data field was known. At that time, when web pages were not yet widespread, it was common to use this type of form to communicate with companies, and thousands of capture processes of this kind were automated (orders, surveys, work reports, exams, etc.).

Later, the automation of semi-structured document capture began, where data varies in position but rules can be created to locate them. The typical project was the capture of supplier invoices, where around 70% of the data could be captured automatically. Subsequently, the technique of learning supplier templates was introduced, now advertised as “Machine Learning.” This technique allows the software to learn the layout of each supplier’s invoice after the user indicates where the data is. Once the main suppliers are learned, and by combining learning with the use of rules, it became possible to capture more than 90% of the invoice data. Over the last 10 years, we have implemented hundreds of such projects, and today we continue to help companies that still haven’t automated their accounts payable processes.

It wasn’t until a few years ago that work began on capturing unstructured documents, where there are no rules to apply and template learning techniques cannot be used because documents of the same type (templates) are not usually repeated. This is the case for capturing data from mortgages, deeds, notarial acts, or emails received in a mailbox.

To extract information from these documents, artificial intelligence techniques must be used, and more specifically, Natural Language Processing (NLP). There are many services and programs that use NLP (Microsoft, Google, etc.), but most of them focus on extracting tags or attributes (for example, getting all the names that appear in a document) and sentiment analysis (whether the text expresses something positive or negative). Moreover, their knowledge bases are created with natural language (the language we usually speak), which is not typically the language used in business documents. That’s why it is important to use a tool capable of using NLP techniques to find specific data in a document (for example, if there are several names in a document, it should indicate who is the Notary and who is the Appearing Party). More generalist tools are usually not suitable for this purpose.

To understand how this type of solution works, here is an example of how data capture from Deeds is performed. The process is divided into three parts:

Learning: This is the key phase, in which a significant (or sufficiently large) sample of documents must be selected, and the system must be taught where the required data is located. The AI engine analyzes the texts (NLP technique), both those containing the data to be found and their surrounding context, to find patterns and assign them weights.

Training is done before going into production, although it can always be adjusted later with new samples. As a result, the software will create a knowledge base containing all the pattern analysis.

In this video, I show how the learning phase is carried out using deed samples as an example:

Recognition: In the production phase, the first step is document recognition. At this point, the knowledge base is applied to each document to find patterns that may match each piece of data. In addition to the data itself, the software returns a confidence percentage.

The recognition phase is usually executed autonomously and as an invisible service to users. However, in this video, I show how it would be done manually and how fast it is:

Validation: After recognition, validation is performed on those documents where there is an issue or a field could not be found.

It’s important to remember that the ultimate goal of all these capture solutions is to reduce the total process time. The objective should never be to achieve a certain recognition percentage. In other blog posts, I have analyzed this topic in more detail, since there are still people who focus solely on capture rates.

In this video, I show the results of the previous recognition applied to deeds. In a production solution, the software would only stop at data points with an issue to be resolved by the user, but in this case, I stopped at each data point so you can see the extraction results:

More and more projects are emerging for capturing unstructured documents, and I believe this will be the trend in the coming years. I hope to continue publishing more successful projects showcasing other types of unstructured documents.

Leave a Reply Cancel reply