Document recognition

Sorting

Sorting is a step in the digitization process by which submitted files are matched against all active templates in an account for the best form-to-template match. It utilizes the document recognition engine, VISION, to analyze the alignment of distinct visual features and overall form structure to determine the best template match. VISION has a robust handling of faxed, upside down, scaled, colored, watermarked, low DPI, and poor quality scans.

Sorting is critical for data extraction. Once a form is matched to the closest template match, fields are mapped to the appropriate field locations as defined on the template. The fields are used to crop the form image into shreds which are then sent to our READ AI, for transcription. A well-configured sorting allows:

  • Inbound pages to have clear matches to the appropriate template.
  • Data to be correctly extracted by READ.

Alignment and sort scores

When a file is processed for sorting, each page receives a sorting score, which is an assigned percentage that measures the alignment accuracy of the form-to-template match. This sorting score is used to determine the quality of the match to a template and is determined by the amount of overlapping features in alignment.

Higher sort scores indicate a more accurate alignment whereas a lower sort score suggests a less accurate alignment. The sorting threshold is 0.2 by default but can be adjusted as needed. All image alignments below the sorting threshold will be categorized as rejected and receive no template match. These rejected pages will not be digitized.

Document recognition dashboard

The document recognition dashboard provides an at-a-glance collection of all batches uploaded in shredder. It is used to determine whether the batch of interest is able to match all of its forms to the template(s) that have been set up in the account using the Sort only feature. Once all the forms have been sorted to a template, you can further investigate the form-to-template matches to discover any necessary changes to the template setup or submit the forms for digitization.

The document recognition dashboard lists the uploaded batches in descending order by the last upload creation date. The batches can be searched through in the search bar by batch name or ID and filtered by batch status. The key statuses can include the following:

  • Setup – The batch has been uploaded into the account. It can be sent for sorting only, or digitization.
  • Sorted – All the files of the batch have been sorted against a template and are ready for digitization.
  • Digitized – The batch has been digitized and the data is ready to be reviewed.
  • Processed – The batch has been sorted but the template has been archived.

In the drop-down menu on the far right of each row, different options (Sort Only or Digitize) can be applied to the batch as needed. The Sort Only option will submit the batch to be matched to a template using the sorting engine and the Digitize option will submit the batch for digitization. The batch can also be renamed or cloned, for example, to run different tests with the same set of forms or viewed by file in the inbox.

To learn how to investigate your form-to-template matches, see How to understand your sorting results.

Submit a batch for sort only

A batch is submitted for “Sort Only” during Sorting Review to evaluate sorting results and confirm pages are matching to the correct templates. Please be sure that the templates used for sorting review have been made active in the Templates Dashboard.

Create a batch

  1. On the Inbox page, drag and drop files intended for digitization into the inbox or click to browse the files on the computer.

  2. A modal for the batch displays. Enter a name for the batch (refer to Best practices for image files), then click OK.

  3. When all the documents have finished uploading, you have the option to add more forms by clicking Add or completing the batch by clicking Finish.

Do not submit the batch for digitization at this stage.

Once the batch has been uploaded and named, it will appear on the document recognition dashboard.

Submit the batch

  1. Navigate to the Sorting Page where the document recognition dashboard is found. A batch will appear in the setup status upon being uploaded.
  2. After identifying the test batch, select Sort Only to push the batch only for sorting. The batch will then be sorted against all active templates in the account. You can click into the batch ID highlighted in blue to see all the forms uploaded in the batch instead of looking through the inbox.

View sorting results

Please be aware that this view works best on batches that have been submitted for Sort Only. For more information, see Submit a batch for sort only.

Sorted batches should be reviewed to determine template setup improvements and whether the application of sorting tools is necessary (for missorts). After a batch has completed sorting, a + button will appear next to the row. When the button is clicked, all the sorted and a single rejected batches will toggle open below the user-submitted batch.

  • Sorted batches – These batches have successfully matched to an active template and are named “[submitted_batch_name] - [matched_template_name]”. There may be multiple sorted batches if there are sheets that match to multiple templates. The tag on the row is the count of sheets that have matched to the template.
  • Rejected batches – There will only be a single rejected batch that contains all the rejected sheets due to low sorting score falling under the sorting threshold. It is named “[submitted_batch_name] - Rejected”.

The batches can be reviewed by selecting the orange button next to the batch.

The batchgrid page

The batchgrid page previews all the pages sorted into that batch and its corresponding sorting score. These files can be searched by file name and ordered by sort score, file name, or file size using the Order by drop-down menu.

As each sheet in the batchgrid page is being reviewed, it may be useful to sort the files by low or high.

The form to template match page

On the batchgrid page, click into the magnifying glass on the top right hand side of a form image to enter the side by side form-to-template overlay. Each page will be available for viewing alongside the template page.

Overlay color codes

The form overlay page displays the submitted form on the left and the template page it matched to on the right. This view allows the user to view the image files side by side and aligned where the template overlays the file image. The color coded alignments suggest how well the form aligns to the template’s form structure.

  • Fluorescent blue – Parts of the form that appears on the template but NOT on the form image
  • Dark orange – Parts of the form image that perfectly aligns with the the template
  • Olive green – Parts of the form image such as handwritten text that do not appear in the template

Alignment features

After a batch is sorted, there are a number of additional tools that can be made in the Sorting dashboard. These tools include the following:

  • Navigating through pages – The arrow keys can be used to move across the various pages.
  • Compare – The Compare switch adds or removes color-coded form-to-template overlay. The letter O can be used as a keyboard shortcut for this toggle.
  • Select Template Page – The Select Template Page option enables users to see other alignments for other version types. Since an image is sorted against every active template in the account, each sheet is assigned a sorting score. The top three sorting score matches will appear and if selected, it will be overlayed to the form. The letter S can be used as a keyboard shortcut to open the panel.
  • Confirm – The Confirm button verifies the form-to-template overlay IS a match. A modal will appear to validate the confirmation, which the user may press the Enter keyboard shortcut for that action. The letter C can be used as a keyboard shortcut for this action. Since sorting happens automatically during Production, this button is only used for sorting review. This will manually adjust the individual file but does not impact the overall workflow.
  • Junk – The Junk button identifies the form as an incompatible match to all of the sheets in the user’s account. Junked forms are immediately sorted into the rejected batch. A modal will appear to validate the confirmation, which the user may press the Enter keyboard shortcut for that action. The letter J can be used as a keyboard shortcut for this action. Since sorting happens automatically during Production, this button is only used for sorting review. This will manually adjust the individual file but does not impact the overall workflow.

How to understand your sorting results

The sorting review is the process of reviewing used to identify the required configurations necessary to improve form-to-template matches. The sorting review is a helpful confirmation process during implementation to evaluate sorting performance and discover any adjustments or improvements that can be made on the template. From the analysis, reviewers should confirm that in scope forms are being accurately identified as well as identify new versions, fields to resize, as well as missorts (for example, false positives and false negatives). Sorting can be imprecise so testing and iteration is required per version to set correctly.

Please be aware that this view works best on batches that have been submitted for Sort Only. For more information, see Submit a batch for sort only.

Out of scope versus in scope

A page can be considered either in scope or out of scope in regards to sorting. An in scope page refers to a structured form page intended for digitization. These should match to the appropriate templates correctly. An out of scope page refers to a form page or a junk/cruft page that is not intended for digitization.

During sorting, an out of scope page may incorrectly sort to a pre-defined template rather than get rejected. This occurs when the out of scope page is a very close visual match to the in scope template. It is important to identify any potential missorting during the sorting review process.

When should I do a sorting review?

Users should do a sorting review for both blank and sample forms after creating or adding templates into their account.

  • For a batch of blank forms – Especially if the template organization includes multiple versions for template pages, users can check for the correct form to template matches to see that all the pages are matching to the correct sheet and if there are any field movements across different versions.
  • For a batch of sample production forms – By submitting sample mock forms for sorting, users can get a better idea of how potentially skewed or poorly scanned forms would match to their template. Handwriting on mock forms are also helpful in identifying fields that require resizing adjustments.

If the account already has built-in templates in the account, sorting review is performed Version Discovery to see if existing templates can capture the new forms.

Users should review the alignment on each page to confirm all fields are being captured as expected and denote where there are matches and differences (confirm if any forms require new version or incorrect form-to-template matches). The complexity of sorting review may be dependent on variables such as form codes, version, and year, which may determine whether it is in scope or out of scope. Users should look for and consider the following template adjustments to improve form-to-template and field alignment:

Sorting observation Description Solution
Correct form-to-template matches All in scope forms intended for digitization are built to be accepted for submission and data extraction and cruft (junk) pages not intended for digitization are sorted away into the reject batch. No action is needed.
Field misalignment Fields should be drawn as tightly as possible yet sizable enough to fully capture the data’s location. Viewing the form-to-template alignment overlay will allow users to determine how well the field was drawn to capture data and adjust accordingly. In the Form-to-Template Match Page, the View DocDef button can be used to access the template to correct the field sizing and placement.
Incorrectly sorted forms False Rejects: an in scope form page sorted to the rejected batch that should have been matched to an active template. In the event that the sort score is low yet matching to a page, it would be grouped into a different batch. False Positives: out of scope form page sorted to an active template (an out of scope near duplicate sorted to an in scope form page). False Rejects can be resolved by adding a blank sheet of the rejected version into the account. If it is in scope and assigned a low sort score, the sheet can be added into the account. If it is in scope and assigned a high sort score, a sorting tool may be needed to resolve the missort. False Positives can be resolved by using sorting tools like the Sort Out template or by reducing the sorting threshold.
New versions These may be forms that were not initially identified as in scope. If an out of scope version is rejected, no action is needed. If an out of scope version was assigned low sort score, it can be resolved with a sort out template or by reducing the sorting threshold.

If changes were made as a result from the first iteration of sorting review, Chorus recommends users to verify that the implemented changes have improved the sorting results by resorting the batches.

Below are some examples and analysis on form-to-template matches:

EXAMPLE 1 - Correct Match (Sort Score: 97.82%)

This form-to-template is a perfect match. The fields capture the data and will digitize high quality data.

EXAMPLE 2 - Correct Match (Sort Score: 67.41%)

The top part of the form is cut off. Since there are no fields in that area and the rest of the form captures the information in the fields, this form would be considered a correct match.

EXAMPLE 3 - Correct Match (Sort Score: 35.20%)

This form is a correct match for the template however, the form was submitted with the top right cut off. This would result in partial digitization of fields that have aligned in the scanned image.

EXAMPLE 4 - Incorrect Match (Sort Score: 9.56%)

The form is entirely different from the template and nothing on the form aligns well. This would likely have a very low sort score, and would likely be rejected.

Sorting tools

There are a variety of sorting tools to guide the document recognition engine to sort the submitted forms to the correct template. Users should avoid using sorting tools unless a missort has been observed during thorough sorting review. If the fields align and the fields are matching to the correct template at a high sort score, the application of sorting tools is not recommended. To determine the best sorting tool to resolve a missort, please use the table below:

Sorting tool Description User access
Exclusion region It outlines an area that should be ignored during alignment and scoring. It is good for logos or other areas that are common across forms, but not helpful in discriminating between forms. Users must request for regions to be enabled for their account. Once an account has access to regions, regions can be applied on an as-needed basis across all templates in the account.
Simple region This narrows the system’s focus for alignment so anything outside a simple region will be blanked out during shredding. Any area outside the simple region(s) is “cropped” or completely ignored during alignment, scoring, and shredding.
Region of difference This tool outlines an area that should be focused on as a data point during alignment and scoring to differentiate from other versions. It is good for nearly identical sheets with less noticeable changes when discriminating between forms.
OCR region It is used to identify specific, prominent features of a form for the purpose of identifying it correctly. It’s drawn over a form label, form code or other snippet of text; READ reads the digitized value and adjusts the sort score using a value supplied by the user.

Sort out template

A sort out template is used when out of scope forms are matching to in scope forms. By creating a separate template for this out of scope form, it ensures that the out of scope form will have a closer match that will have a higher sort score than the in scope template, and the Sort Out page will be treated as a “rejected” page.

Users can create a Sort out template as needed, however, they should notify an SS&C Chorus team member to configure the returned output to exclude the data digitized from the sort out template.

Exclusion regions

Regions must be enabled for the account before they can be used.

This tool outlines an area that should be ignored during alignment and scoring. It is good for logos or other areas that are common across forms, but not helpful in discriminating between forms.

Exclusion regions are applied on a per sheet basis. Unlike field behavior across versions, regions will not duplicate into other versions in the template page. This means a new region must be created for each sheet where it is needed. For the best results, a maximum of two exclusion regions should be used. The application of an exclusion region where it is not needed may be more harmful than helpful.

Exclusion regions should be used:

  • Over logos or images which may be consistent across multiple pages.
  • When there is extraneous text or titles that are repeated on multiple template pages.

Apply an exclusion region

  1. Enter the document definition page to view your template. Navigate to the specific sheet that is in need of a simple region.
  2. Click the Regions icon.

  3. Click the + Add Region button.

    The region will appear on the image.

  4. Click the Exclusion option to set the region as an exclusion region.

  5. Enter a name for the region (optional).

  6. Move the region over the desired location and resize the region as tightly as possible.

  7. Repeat steps 1-6 for every sheet in the template where required.

Form to template alignment example

The region of exclusion helps the document recognition engine ignore distracting form features. In this example, an older version contains data that could be captured despite the slight difference of form design(ie. updated logo). When that older version is submitted without the region of exclusion, the title is used as the key form feature that the engine will use to align the forms. Notice the significant misalignments result in low sort scores. This means that the form will be rejected and will not digitize. This can be resolved using the region of exclusion.

After applying the region of exclusion on the template and sorted once more, notice that the area where the region of exclusion has been placed on the template (right) is where the document recognition engine will "ignore" on the submitted form (left). Now the older version of the form aligns well to the newest version. There is alignment of the fields, the sort score has increased to above the default sorting threshold (20%), and this older version will now be digitized.

Simple regions

Regions must be enabled for the account before they can be used.

When drawn on a sheet, a simple region narrows the system’s focus for alignment so anything outside a simple region will be blanked out during shredding. Any area outside the simple region(s) is “cropped” or completely ignored during alignment, scoring, and shredding. A simple region is not necessary if all templates are visually distinct.

Simple regions are applied on a per sheet basis. Unlike field behavior across versions, regions will not duplicate into other versions in the template page. This means a new region must be created for each sheet where it is needed. It is important to note that only one simple region should be placed on a single sheet at a time.

Simple regions should be used when:

  • There is extraneous information such as page borders, for example, on a death certificate.
  • Sections of information must be captured on in scope pages which move across different versions.

Apply a simple region

  1. Enter the document definition page to view your template. Navigate to the specific sheet that is in need of a simple region.
  2. Click the Regions icon.

  3. Click the + Add Region button.

    The region will appear on the image.

  4. Select the Simple option to set the region to a simple region.

  5. Enter a name for the region (optional).
  6. Move the region over the desired location and resize the region as tightly as possible.

  7. Repeat steps 1-6 for every sheet in the template where required.

Regions of difference

Regions must be enabled for the account before they can be used.

This tool outlines an area that should be focused on as a data point during alignment and scoring to differentiate from other versions. It is useful for nearly identical sheets with less noticeable changes when discriminating between forms.

Regions of difference are applied on a per sheet basis. Unlike field behavior across versions, regions will not duplicate into other versions in the template page. This means a new region must be created for each sheet where it is needed. For the best results, a maximum of one region of difference should be used. The application of a region of difference where it is not needed may be more harmful than helpful.

Regions of difference should be used on pages that missort as a result of a minor form change, for example, form field name change.

Apply a region of difference

  1. Enter the document definition page to view your template. Navigate to the specific sheet that requires a region of difference.
  2. Click the Regions icon.

  3. Click the + Add Region button.

    The region will appear on the image.

  4. Select the Difference option to set the region to a region of difference.

  5. Move the region over the desired location and resize the region as tightly as possible.

  6. Enter a name for the region (optional).

Form to template alignment example

The region of difference helps the document recognition engine focus on minute form design changes. In this example, another in scope form contains a different form field name and field to capture different information. Notice the high sort score (98.33%) despite the field differences (first and last name versus full name) in Section 1. This means that the full name field of the matched page will not be able to correctly capture the first and last name that was originally separated by form design in the submitted form. This can be resolved using the region of difference.

After applying the region of difference over the form design difference, the incorrect page is now assigned a lower sort score (74.14%) and the form will be able to match the correct template page at a higher sort score (96.75%).

OCR regions

OCR regions are used to identify specific, prominent features of a form for the purpose of identifying it correctly. Typically drawn over a form label, form code or other snippet of text; READ reads the digitized value and adjusts the sort score using a value supplied by the user.

  • Text matching can be set to: equals, contains, or does not contain.
  • Prior to text matching, a Regex can be applied to the digitized value.
  • An Edit Tolerance can be set to define the number of character differences between the supplied text and the OCR read value.
  • The weight* of this field is set to a number between 0 and 1. This weight will be added to the overall sort score in the event of an OCR match, and subtracted from the overall sort score if no valid match is found within the Edit Tolerance.

*If multiple OCR regions are placed on the same template sheet, only one OCR region will contribute to the final sort score:

  1. If at least one match is found: the highest weighted positive match will be added to the score.
  2. If no match is found: the highest weighted non-match will be subtracted from the score.

OCR regions should be used when:

  • There are extraneous information such as page borders, for example, on a death certificate.
  • Sections of information must be captured on in scope pages which move across different versions.

Since OCR regions are highly configurable and their influence on the sort score can be difficult to understand, it should only be used after all other Sorting Tools have been tried. If you have identified an instance where an OCR region may be needed to differentiate forms, please contact an SS&C Chorus team member.

Sort out templates

A sort out template is used when out of scope forms are matching to in scope forms. By creating a separate template for this out of scope form, it ensures that the out of scope form will have a closer match that will have a higher sort score than the in scope template, and the Sort Out page will be treated as a “rejected” page.

Sort out templates should be used when out of scope pages consistently missort to in scope pages with a high sort score. Inserting a sort out template will cause those out of scope pages to match the sort out template with an even higher sort score.

Sorting threshold

A sorting threshold is the level at which pages are accepted to digitize. It filters forms that are unlikely matches to the template and allows for forms with more accurate alignments to be digitized. All users are set at a standard threshold of .20 by default. This means, any page below a sort score of 20% will be categorized into the rejected batch. It can be customized at the user-level or a sheet-level (version) to reject any pages with a score lower than your specified value. If your account requires a specific sorting threshold, please contact Support to coordinate sorting reviews with different sorting thresholds.

Why does the threshold matter?

Sorting can be imprecise so testing and iteration is required per version to set correctly.

  • Raising the threshold will require more exact alignments to match.
  • Lowering the threshold will require less exact alignments to match.

How to determine a sorting threshold

By conducting a sorting review, users may discover sorting anomalies such as low match scores and incorrectly matched sheets with high sort scores. The optimal sorting review should be identified after a sorting review with a sufficient set production-level forms to have the best representation of sorting scores.

  Raising the threshold

Lowering the threshold

PROs
  • Increase accuracy of sorting by lowering risk of missorts. (ex. near duplicates of form versions)
  • Decrease blank/impossible data values caused by missorts
  • Capture more poor quality scans of form pages with transcribable data
CONs
  • More pages are rejected, some of which may include falsely rejected pages that contain desired data.
  • Initial data set may not be representative, sort scores can be unreliable and adjusting thresholds may result in unintended consequences
  • Increased rate of false positives/matches to templates (ex. junk/cruft, cover pages, etc.)
  • Illegible data (sorted properly but unclear writing)

Similarly, thresholds can be adjusted at the sheet-level. If a user discovers many false positives or false negatives, a custom sorting threshold can be set at the sheet level (independent from the user-level sorting threshold) to prevent data loss. This can also be applied where an older version amongst newer versions are being accepted at different low match scores. Sometimes, adjusting the sorting threshold at the user level will cause this issue with other templates and a custom sorting threshold at the sheet level will help distinguish the submitted forms.

If you have identified an instance where a change in the sorting threshold may be needed to filter in/out forms, please contact an SS&C Chorus team member.