Decipher best practices

Click this icon on the toolbar to view and download a PDF version of this guide.

This is a high-level guide to the implementation of Decipher. The intended audience is users who have Decipher installed and running and are confident with the functionality. It is suitable for users seeking to implement a proof of concept or to progress to a production implementation.

The guide is broken into the following stages:

Stage Aim Team responsible

1. Preparation

Define and create solid document and process requirements.

Automation team and business team.

2. Configuration

Test document form definition (DFD) configuration and make amendments as required.

Automation team.

3. Training

Train Decipher IDP to achieve the target success rate or higher.

Automation team, with business team in support.

4. Testing

Confirm the training is delivering the desired result and agreed targets.

Automation team, with business team in support.

5. Go live

Move to a production environment.

Business team, with automation team in support.

Stage 1: Preparation

During this stage you will be gathering information about the documents you aim to use with Decipher, and the wider process, to build a solid foundation for your Decipher configuration. You can test a couple of sample documents at this stage. These early findings can help with your preparation, or show that the use case is not the right fit.

Documenting your requirements

It is recommended that key information gathered at this stage is documented for future reference and to track any changes. The following areas are points to consider with your document and process owners, but this criteria can be updated to fit your use case:

Area

Information required

Process

  • What is the overarching process?

  • Where do the documents come from?

  • Do they need to be moved or deleted, and is there a data retention period?

  • What is the data classification?

Document

  • Which document fields are to be read, alongside required fields and formats?

  • Do any fields need to be calculated or validated?

  • What is the expected document and page volume?

  • Collect sample documents, checking that:

    • Sample documents contain all required fields.

    • There are no missing fields.

    • Note any consistent headers for field headers.

  • Are any validation lists available?

  • Manually annotate known document formats to ensure data is identified correctly during training.

  • Allocate sets of training documents and test documents.

Success criteria

  • Agree on targets for success rates for auto-document processing.

  • Discuss reporting requirements.

Strategy

  • Identify roles responsible for manual verification once live.

  • Identify exception scenarios and resolution paths.

  • Define a backup plan and disaster recovery.

  • Train all formats, or a sub-set, to gain confidence.

Blue Prism relationship

  • Is the document linked to a wider Blue Prism process?

  • What are the dependencies?

Gathering your sample documents

You will need separate document sets for training and for testing, so you can test that the training has been successful before going live. There are additional considerations if you choose to use machine learning models.

Training stage

Documents required

Notes

Initial training

Up to five documents per layout, with a similar quantity for testing.

  • The initial training stages of Decipher will learn based on the different layouts of your document types. For example, if your document type is an invoice, you will likely receive invoices with differing layouts from each supplier or vendor.

  • It is not necessary to train each layout prior to going live, but you will need to ensure your DFD is configured to successfully take these differences into account.

Machine learning models

1000 documents (default).

  • Training machine learning models requires a greater number of documents. The default is set to 1000, which can be built upon throughout the model's lifetime. Depending on your target to go live, you may be able to delay this stage until you have moved your environment to production. This will make the document count easier to achieve as you will not need to source as many training documents.

  • A machine learning capture model is complementary to Decipher's native training model and will only be effective when used in tandem with high quality training data. If you’re only achieving an accuracy of 50%, it’s unlikely that additional machine learning models increase this value. At this stage, it is more valuable to take time to refine your training data and DFD first. Then, enabling the machine learning model will improve the overall document confidence as well as field extraction.

Stage 2: Configuration

In this stage you will be testing your DFD configuration, and making the changes required to train each document format. In doing so, the resulting DFD configuration should be kept to a minimum, allowing it to be broadly applicable to potential document formats. You will need approximately five sample documents for each layout, using a range of layouts and styles to maximize effectiveness. Decipher uses the following elements, known as hints, to generate training data:

  • Keywords (sample headers)

  • Data types (date, integer)

  • Lists (validation lists, dynamic lists)

  • Format expression (regex)

  • Formulas

  • Location (after at least one document has been trained for the layout)

The table below outlines steps to guide you through testing your DFD. If encountering unexpected results or challenges, it is important to consult the subject matter expert (SME) in the relevant business area.

Step Number

Action

Notes

1

Create a DFD with all known fields.

Keep configuration to a minimum while you learn how the document formats respond to each setting.

2

Create a document and batch type.

Do not enable machine learning at this stage.

3

Upload one sample document.

It is recommended the document is a good representation of expected documents in terms of image quality and consistent format.

4

Check Data Verification to see which fields are automatically recognized.

Consider:

  • Would changes to sample headers improve success?

  • Are there any incorrectly highlighted regions? Would a DFD update help resolve this?

5

Make DFD changes as required.

If you need to make DFD changes:

  1. Return the batch without saving, to prevent the training data from being updated before it is ready, and make the DFD amendments.

  2. Restart the batch at the Capture stage to update the document verification results with the DFD changes.

  3. Repeat as required to refine your initial results.

If a change does not have the expected impact, you may need to remove it and try another method. Decipher's capture processing is flexible in its design, and there are often multiple ways a solution can be implemented.

6

Manually set the regions as required.

You may need to manually set regions for fields which have not been recognized, and update any that are incorrect.

7

Submit the batch.

This will update the training data.

8

Upload a second sample document.

Use the same format as the first sample document.

9

Check Data Verification to see the level of recall after training.

Consider:

  • Are all fields correctly recognized? If not, try and determine the cause of the issue. This could be the sample header(s), the data format, the image quality, duplicate information, or lack of visible hints where only location can be used to identify the region.

  • Are any fields showing as low confidence (in red or brown text)? Is there any clear reason for this? This could be additional text in the area being excluded by formatting, similar sample headers, low quality image or lack of visible hints.

10

Make further DFD changes as required.

If you need to make DFD changes:

  1. Return the batch without saving, to prevent the training data from being updated before it is ready, and make the DFD amendments.

  2. Restart the batch at the Capture stage to update the document verification results with the DFD changes.

  3. Repeat as required to refine your results.

Be mindful of how your changes may impact the other document formats. A helpful sample header for one format may be conflicting in another. The aim is to add the minimum configuration required to get the highest success rate across all sample documents. If there are format specific changes to be made, consider using a Specific Version. However, be cautious as you may not want to have hundreds of these for a single DFD.

11

Upload a further three to five samples.

Use the same format as the previous sample documents.

12

Check Data Verification to see the level of recall after training.

If the new samples meet the required success, move onto your next sample format.

13

Repeat steps three to twelve for each sample format.

This section is complete when you have tested multiple document formats successfully and do not see it being likely that further DFD changes will be required, except where Specific Versions may help.

14

Export your training data and delete it from the Decipher environment.

Export the training data to a folder on your local system for safe keeping.

Stage 3: Training

The aim of this stage is to adequately train Decipherto achieve the target success rate, or higher if possible. It’s still recommended to keep the machine learning function disabled at this time, due to the volume required to train such a model is likely much higher than your training volume. Also, the machine learning model should be considered supplementary to the training data as it will only improve an already well-trained document.

Step Number

Action

Notes

1

Split your sample documents into batches.

When creating batches:

  • Try to have multiple samples of each format, but keep in mind that Decipher will continue to learn in production.

  • Start with smaller batches, so that more regions are automatically identified in larger batches.

  • Mix up your batches with different formats.

2

Upload one batch.

Uploading one batch at a time at the beginning of the process assists in monitoring the training and quickly identifying any unexpected results.

3

Verify each batch and record the results.

To record your results, export a Data Capture Accuracy report.

If your training results are not progressing as expected, check for potential reasons why before moving onto the next stage. Common issues include:

  • Poor image quality

  • Old training data retained in the system

  • Samples not used in the configuration stages

  • Improper manual verification (selected the wrong region)

Stage 4: Testing

The testing stage is to confirm that the training is delivering the desired results and that you have achieved the agreed targets to progress to going live.

Split your test documents into batches, starting with smaller batches and increasing to the expected batch size that will be used in production environments. The number of batches, or rounds of testing, required is dependent on the number of test documents available, and the level of confidence required by the process owner. Between three and five batches should be sufficient in most cases, and should cover multiple examples for the majority of known document layouts.

Batches should be uploaded one at a time to assess the quality of extraction. The auto-skip function should be disabled so you can check all results.

Common statistics to assess include:

  • Percentage of documents that had all fields correctly identified, with no validation errors or confidence issues (highlighted with red text). These are documents which would have passed through Decipher automatically with the correct data, without the need for manual intervention.

  • Percentage of documents that incorrectly identified data, which would not have been held for manual processing. This would result in incorrect data being passed toBlue Prism.

  • Percentage of documents with all fields correctly identified, but with some regions flagged with low confidence.

  • Percentage of documents incorrectly held up by validation rules.

  • Percentage of documents with at least one field not identified. This can also be split by the number, or percentage, of fields not identified.

  • Percentage of documents where data was missing, and which would have required manual processing or to be marked as an exception.

Stage 5: Go live

At the start of this process, you defined the target success rates the testing needed to achieve. If these have not been attained, it is not advised to put your document into production. Even with the added machine learning models to be trained in production, there is no guarantee that the success rate will improve to meet your targets.

At this stage, there are three common scenarios:

Success rate

Result

Action

All targets have been achieved.

The Blue Prism process and Decipher document will go live.

Export the batch type (which will include the document types, DFDs and classification model), the training data and the capture model (where applicable).

Some targets have been achieved.

A sub-group of documents will go live as a common layout was successful enough to be moved into production.

  • Separate the documents prior to uploading into Decipher. The aim is to only upload the documents that are ready for use in production. This is not always possible and is dependent on how the documents are received. They can be extracted manually, or using a Blue Prism process (for example, to only upload documents for a specific email address).

  • Note that any changes made to the DFD in the testing stage will required the live documents be retrained as well.

  • If machine learning is enabled in production, the model will have to be reset or deleted when the remainder of the documents go live. If not, you will have a model that is not based on the new documents, and will not efficiently inform decisions on their data extraction.

The targets were not sufficiently achieved.

Documents will not be moved into production.

The training plan should be revised and the DFD amended.

Machine learning

Once in production, you may want to enable the additional machine learning capture model, see Machine learning for details. This is an optional step if you are happy with the levels of success achieved by your setup. Machine learning will impact the performance of the capture stage, so you may want to skip this if maximum processing speed is required.

Machine learning training can be carried out as a one-time activity, or at regular intervals by a configured document count. Whichever method you choose, the number of documents used to create the model is a key parameter. While it’s possible to set the document count as low as ten, this is not recommended. The resulting model would not have sufficient experience to process the majority of your documents and potentially reduce your success rate. For example, if you have 100 document layouts, training using only ten random documents will only have seen 10% of your layouts.

Where the native training model creates a new entry for each layout (which are only used for their respective documents), the machine learning model is not layout specific and training from each layout will be used on all documents. This functions well when you have an experienced model but can be detrimental if the training set is too small.

It is recommended to aim for between five and ten times the number of known layouts. For example, if you have 100 layouts, the training size should be between 500 and 1000. The higher the training size, the better the model will be with a higher chance of improving the document extraction.