Don’t Be Misled by the OCR Percentage – Blog de Fermin Fernandez

I still keep seeing projects where the success of a data capture system is measured based on the OCR accuracy rate. Even in some proof-of-concept tests, clients still tend to compare different solutions according to the extraction percentages obtained.

I suppose this is partly our fault as technology providers, since in the past we focused heavily on this parameter and always tried to improve it as much as possible in our implementations. This year, Kofax published an article on this topic: The Truth About OCR Accuracy, which I want to share with everyone interested in document data capture.

The basic issue is that the capture (or OCR) percentage by itself is not a meaningful business metric. For example, what decision can an executive make if we tell them that one solution has an 80% capture rate and another has 70%? Probably none! How could they understand the impact of either solution on their business, or how would they calculate a possible return on investment? They couldn’t! Most likely, they would request more information to understand the implications of the project for their business.

It would be too simplistic to think that the first solution is better based solely on this figure. What if this first solution has fewer features to facilitate exception handling (data that could not be captured)? Suppose that with the first solution it takes twice as long to resolve each exception. In this scenario, it could actually be faster to handle the 30% of exceptions in the second solution than the 20% in the first. In other words, the second solution could be more effective from a business perspective, offering greater benefits to the client. In fact, the higher the volume of documents to be processed, the greater the benefit compared to the first solution. The article mentioned above also describes techniques that facilitate exception management.

Another factor that can skew our perception is the recognition threshold (the probability limit for accepting data as correct). This is a number (between 1 and 100) that is set manually. Usually, only data with a high recognition threshold (for example, above 80%) is accepted. If the first solution has lowered this threshold significantly (let’s say to 25%), it might get some hits and thus its accuracy rate increases, but you can no longer trust the data it returns, since many of them will be incorrect. For this reason, all information would need to be confirmed manually (everything must be validated because you never know when it will be accurate).

If the second solution has set a higher threshold, it guarantees better quality of the extracted information, but by rejecting more data, its accuracy rate is penalized. The paradox is that both solutions could actually be returning exactly the same data, but the second would appear worse. In both cases, the user must handle the data manually (either by validation or rejection), so once again, the most relevant metric is the time the employee needs to manage exceptions.

In summary, the OCR percentage is an indicative but insufficient metric. What really matters is the total time required to process a document on average, from start to finish. This time depends not only on the OCR percentage but also on the speed at which exceptions are managed. With this information, decisions can actually be made. For example, if a solution allows me to process documents in 70% less time than today’s manual process, I’d very likely be interested in implementing it. If solution A allows me to process 1,000 documents per day and solution B processes 800 documents per day, I can already calculate which one will offer me a better return on investment.

If, despite this, someone is only interested in the OCR percentage, my recommendation would be to simply choose an OCR engine (there are even free ones), not a complete capture solution.

Finally, I would like to highlight that in the modern implementation of these types of solutions, where increasingly complex documents are being handled, the OCR percentage is becoming less and less relevant. Traditionally, when working with more structured documents, rule-based systems were used to capture information. We kept adding more and more rules to improve the percentage, but it became increasingly complicated because each new rule affected the previous ones, so improvements eventually plateaued. Today, projects involve more complex documents such as mortgages, deeds, meeting minutes, etc., and are mostly based on machine learning techniques. In other words, the system is allowed to learn on its own as it processes documentation. No rules are implemented. The challenge with this AI technique is that it needs a lot of samples to learn well. Since there usually aren’t that many examples, projects start with lower recognition rates and the main focus of implementation is on designing effective forms for data correction and entry. The return on investment isn’t as quick, but the cost of manually processing these complex documents is very high, and over time (and more documents) the solution keeps learning and the savings eventually become significant.

Leave a Reply Cancel reply