Training models overview

SS&C | Blue Prism® Decipher IDP uses training models to learn, and gradually improve, document processing. This topic provides an overview of the different types of training available in Decipher IDP:

Rules-based training (default)

Each document uploaded to Decipher IDP trains the rules-based model, and once verified, feeds into an overall training pool. This model is often accurate enough for structured documents without enabling the additional machine learning functionality. This training is based on the document layout, and is not directly connected to the DFD or document type. Training from similar documents layouts is combined where Decipher IDP has matched 60% of the text fields. This percentage can be modified using the TemplateMinMatchPercent miscellaneous parameter. For more details, see Miscellaneous parameters.

In addition, the training data can be imported, exported, and deleted. Imported data that contains duplicate information will be intelligently merged by Decipher IDP. For more details, see Training data.

The rules-based training captures data from the following elements, referred to as hints, defined in the DFD:

  • Keywords

  • Data types

  • Lists

  • Regex

  • Formula

  • Location (after training)

With the exception of location data, this information is available in the DFD and is used in combination with the training data (where it exists). Location data is captured after the first document has been trained.

Document classification

Document classification is carried out by uploading a group of documents for a document type, which will inform how Decipher IDP separates batches of documents when required. This is only required when you have more than one document type selected in your batch type. It will ensure the correct document form definition (DFD) is assigned, and the requested data is extracted.

For more details, see Training classification models.

Structured machine learning

Enabling this model in a document type will add an extra layer of document-specific learning for structured documents, supplementing the rules-based training for increased success. The scale of the improvement is dependent on the level of success in the original rules-based training, and can be monitored using Data Capture Accuracy reports. For more detail, see Reports – Accuracy.

The default training size is 1,000 documents, but this can be set to between 50 and 5,000. The model can be set to train automatically, or periodically after a configured number of documents. For more details, see Machine learning.

Changing the IDs of any fields will cause the training associated with them to be lost, as the model identifies the fields by the ID. If new fields are added, the model will only have existing knowledge of the previous fields, but will gradually learn the new elements.

Unstructured machine learning (natural language processing)

The unstructured training model can be used where the rules-based method would have difficulty gathering the required data from a document containing free flowing text, such as contracts or agreements. For more details, see Natural Language Processing (NLP) plugin.

The default training size is 300 documents, which is not extensible. This means that on each set of training, a new model is created and the previous training is discarded. As such, it's recommended that the model is created using a wide range of formats and layouts as a one-off training exercise. This will maintain the integrity of the training data and reduce the chance of unexpected outcomes.