Decipher IDP best practices

This is a high-level guide to the implementation of Decipher IDP. The intended audience is users who have Decipher IDP installed and running and are confident with the functionality. This guide is suitable for users seeking to implement a proof of concept or to progress to a production implementation.

Blue Prism recommend a phased approach to implementing a use case with Decipher IDP. Utilizing the capability of the product, we can ensure high accuracy throughout the use case life cycle. Each phase is comprised of multiple iterations (batches processed) and is characterized by the total number of documents processed per document type, the expected level of required manual verification, and the expected machine precision and recall.

In general, early phases will have lower machine accuracy and higher manual effort, while later phases will have higher machine accuracy and lower manual effort. We realize the business benefits of Decipher IDP by reaching the later phases of this use case life cycle.

Phase 1: Setup and configuration

Phase characteristics

  • Total number of documents processed per document type: 0-50
  • Number of documents per batch: 1 sample per layout
  • Level of manual verification required: High
  • Expected machine precision: Medium
  • Expected machine recall: Low
  • Type of documents processed: Sample/Test
  • Recommended QA setting (percentage sampled for secondary verification): High

Key activities

  • Ensure Decipher IDP is successfully installed and interacting successfully with Blue Prism. Upstream and downstream activities, such as preparing batches for import into Decipher IDP or processing data in Blue Prism after it has been extracted by Decipher IDP, are not covered in the scope of this document.
  • Ensure access to a sufficient quantity of documents for training purposes, image format is supported by Decipher IDP, and image quality is good enough. In general, the higher the image quality, the better Decipher IDP will perform. 300 dpi is the absolute minimum, but 600 dpi+ is preferred.
  • Documents should adhere as closely as possible to the real-world conditions of the use case. Five to six sample documents is enough for the initial configuration and to begin training. More documents is generally better, and for Decipher IDP to perform at scale with minimal human intervention, we can expect this training set to grow as the machine learning model is trained.
  • Business requirements to gather: processing volume and velocity, data to be extracted, required accuracy, and any other relevant SLAs.
  • At this point, there is often an inclination to start at the more granular level, the document form definition (DFD), and work our way up. Resist the urge to jump directly into the document form definitions (DFD)s. Assuming the perspective of the business, start at the higher level and work down. First, determine how are the documents coming to us, then determine what needs to happen with to the documents. These answers will inform how we will administer the Decipher IDP use case, namely what batch types, document types, and DFDs we will need to configure.
  • Document Configuration Planning: Based on business requirements of the use case, decide on how many batch types, document types, and DFDs are needed. For less experienced teams, select simple use cases, where this will be 1:1:1 batch type to doc type to DFD.

Batch Types

  • Look upstream – How are documents coming in? Mixed batches or uniform? Mixed batches will require classification step, but uniform batches can skip classification.
  • Mixed batches– It’s OK to have multiple document types in the same batch type, as Decipher IDP can classify accordingly. Limitation: all document types in the same batch type share language, locale, priority level, and processing time SLA settings. If any of these requirements differ, we recommend splitting up batches at the classify stage.
  • Quality control settings are also configured at the batch level. Human annotators can lose concentration and become more prone to errors after a certain number of documents. Verification is a repetitive activity. Decipher IDP provides the option to have more than one operator review a set percentage of batches. Recommend increasing percentage for batches with less mature machine learning (ML) models or with critical accuracy requirements. You can also increase percentage by user or user group for less experienced operators.

Document Types

Think of the Document Type as the coupling mechanism for DFD and the ML model. Most of the time, a use case will correspond to a document type.

The machine learning default setting is off. It’s OK to leave it off for now. As we will explain later, it’s important to get the DFD in a stable state before worrying about the machine learning.

Document Form Definitions

  • Refer to the business requirements we just gathered. What are the common set of data to be extracted from each document and is it presented in a consistent manner? If yes, create a DFD. If no, consider creating multiple DFDs. Having a common set of data to be extracted is more important than the format. Consider the tradeoffs of one or multiple DFDs, but don’t be too prescriptive. Decipher IDP’s neural network is flexible.
  • Document configuration execution:
    • Requirements gathering is top down, but actual document configuration within the Decipher IDP admin panel is bottom up. Start by creating the DFD, then the document type(s), and finally the batch type(s).
    • Review the field options for DFD files. Options can be useful to improve accuracy and enforce validation rules, especially if documents in the use case are more uniform in appearance.
    • Formatting and field options can help improve accuracy up to a point, but it can also make the model too rigid. Exercise restraint if you expect there to be more variation in document appearance for a use case. Over-fitting the ML model is a concern to keep in mind when configuring a DFD. Beginning with all fields defined as text type for inaccurate fields helps recognize them more efficiently. This can be tuned into more appropriate data types once recall improves.
  • DFD Iteration:
    • Test a few small batches to determine if any changes to DFD are needed. DFDs must be stable before we invest the time to train 1,000s of documents. Minor changes, such as adding sample headers or adjusting formulae, are OK, but fundamental changes to the DFD will reset the ML model, causing prior training to be lost.
    • Refer to the Advanced Configuration module of the Decipher IDP Foundation Training course to understand how to train, measure and test the Decipher IDP model.

Phase 2: Annotation

Phase characteristics:

  • Total number of documents processed per document type: 50+, up to 1000 depending on current accuracy performance
  • Number of documents per batch: 2-5 samples per layout (assuming random sampling)
  • Level of manual verification required: High
  • Expected machine precision: Medium
  • Expected machine recall: Medium
  • Type of documents processed: Sample/Test
  • Recommended QA setting (percentage sampled for secondary verification): Medium

Key Activities

Now that the DFD is stable, we can turn on machine learning and begin to train the ML model. If you have not already done so, create a new ML model and attach it to the document type. With fewer documents observed so far, our focus is on precision. Recall will improve over time as more documents are processed and Decipher IDP learns from prior annotations. False negatives can always be addressed by manual annotation, but false positives are riskier/problematic and should be addressed.

Continue training on test/sample documents until you reach a suitable level of accuracy. A 1,000-document threshold is generally the higher limit for this phase. Variety of document sampling is important, and re-training on documents that have already been processed should be avoided. The closer to real-world conditions the better. This accelerates training and avoids over fitting the model.

Phase 3: Machine learning training

Phase characteristics:

  • Total number of documents processed per document type: 1000-5000
  • Number of documents per batch: Completely random sampling with max batch size of 25-50 pages (careful attention to mutli paged documents)
  • Level of manual verification required: Medium
  • Expected machine precision: High
  • Expected machine recall: Medium
  • Type of documents processed: Sample/test or live documents
  • Recommended QA setting (percentage sampled for secondary verification): Low

Key Activities

After the initial 1,000 document threshold is reached, Decipher IDP’s neural network is triggered. Hopefully precision has reached a high level, and we can shift our focus to improving recall. At this point, you should be comfortable introducing live documents. You have the option to retrain model after X number of documents. The default setting is 1000, and range is 50-5,000. Recommend adjusting the setting based on current performance. Early iterations may require retraining after 50 docs, but once a model has enough training data, retraining can occur less frequently to conserve resources/accelerate processing.

Phase 4: Processing at scale

Phase characteristics:

  • Total number of documents processed per document type: 5000+
  • Level of manual verification required: Low
  • Expected machine precision: High
  • Expected machine recall: High
  • Type of documents processed: Live documents
  • Recommended QA setting (percentage sampled for secondary verification): Low

Key activities

At this phase, the ML model is mature and Decipher IDP accuracy has been optimized. Decipher IDP will continue to learn from the most recent 5,000 documents observed. As less manual verification is required, consider scaling up batch sizes as allowed by the size of the deployment and scaling up throughput. Setting a benchmark accuracy required to skip class verification will also save time. If accuracy remains at an acceptable level, lower or remove the percentage marked for QA, but keep quality control feature enabled for critical batch types. Monitor exceptions for trends.