Version Discovery

What is Version Discovery?

In the Scoping phase of the project timeline, Version Discovery is a process used to determine the optimal amount and organization of sheets to upload in the Document Automation system. Many forms are typically developed as a set, with corresponding version types. These versions do not necessarily require their own template depending on the close visual structure between the different versions. The Cluster Tool identifies which form pages are categorized as visually similar in the Document Automation system, thus determining how many template sheets need to be built. This process streamlines the template build process during implementation and results in effective form-to-template matching when digitizing forms.

The Version Discovery timeline

The steps to complete the process requires all in-scope blank forms to be provided to the Document Automation Team to be processed in the internal staff-facing Cluster Tool. Upon receipt, the Document Automation team will return the Cluster Tool results within 1-2 business days. Version Discovery can be a very complicated process if you have many form types and versions.

What is a version?

A version is any individual sheet which represents one possible visual variation of a given template page. Rather than using form classifications based on state and year, the Document Automation’s sorting engine relies on form structure and design to identify in-scope forms to digitize.

Templates are typically built with versions to encapsulate all visual variations in a set of in-scope forms. Only visually unique and visually similar (but not duplicate) form pages are required for upload. Near Duplicate pages uploaded into the account will cause complications during the sorting process. During Version Discovery, each variation of field and template structure is identified to ensure accurate data capture.

Versions in a template typically have the exact or similar fields and may be reused in a template. Once fields are placed on one version of the template, they will appear in the exact location across all versions of that template page. Those fields will need to be moved/resized or made inactive.

When should a version be created?

A version should be created if a sheet has any machine recognizable visual differences such as changes to form text, spacing, font, or additional fields.

Only visually unique versions should be uploaded as a template page. A visually unique page does not share the same form structure or data fields on a form as another.

Types of sheets

Types of sheets table

  • Identical Sheet – Forms with the exact data fields and visual layout. Only one sheet of the two should be uploaded to represent both since VISION will identify these as identical.
  • Near Duplicate – Forms with similar data fields as another and are visually similar. Only one of the two form pages will be uploaded as a single version to represent both form pages.
  • Unique Sheet – Forms with the same data fields as another but are visually different. These two sheets should be two separate versions.

What is the Cluster Tool?

Rather than using the human eye to measure visual nuances, Document Automation’s Cluster Tool utilizes our sorting engine to analyze and aggregate near duplicate pages into clusters. Each cluster is a group of near duplicates or a single unique sheet in a folder. Each folder also represents a single sheet that should be uploaded into the Document Automation system, either as a template page or a version of a template page. For example, the cluster tool results would return a series of folders like the following:

Cluster tool representation

This tool also produces a report that summarizes the clusters created from all the submitted forms.

Cluster Tool submission guidelines

  • Forms submitted to the Cluster Tool must be clean blank forms for the best results.
  • You can submit a maximum of 3,000 blank forms for discovery.
  • All the forms that are in-scope for a single workflow should be submitted to the Cluster Tool so each form page is measured against all others for unique form structure.
  • Form file names should be descriptive with its form name and version (i.e. Claimant_form-2017-verA.pdf).
  • The Version Discovery Tool is staff facing. You must request for and provide forms to be processed through the Cluster Tool.
  • The files submitted to the Cluster Tool can be in PNG, TIFF, GIF, JPG, or PDF file format (DOC or DOCX file formats are not compatible).

Assessing Cluster Tool results

The Cluster Tool returns a summary report alongside a group of folders with its own clusters of near duplicate or unique pages. These results must be organized and reviewed to determine how sheets will get uploaded into the Document Automation system.

Each folder represents a single sheet to be uploaded into the system. For example, pages one of Forms CF-A, CF-B, and CF-D are submitted to the Cluster Tool, which results in two folders. In Folder 0, Forms CF-A and CF-B are near duplicates; only a single form page is needed to be uploaded as a version. In Folder 1, Form CF-D is separately grouped into another folder because its form structure is different from CF-A and CF-B. Only one sheet from each cluster (folder) should be uploaded into the system. This would result in a single template to be uploaded with two template sheets.

Stitching sheets into a Template

Stitching sheets into a Template

Because all in-scope forms must be compared amongst each other, it is common for multi-page forms to have varying form similarities across individual form pages between different version types. In this example (below), there are two versions of a form where not all pages of both forms are required to be uploaded into the template. Some forms pages in a set require certain pages to be uploaded as distinct template pages, while other form pages are required to be uploaded as a single unique template page.

In this scenario, there are two Form CFs with two different versions. Running the forms into the version discovery tool, the results show that Pages 1 and 3 are near duplicates, while the second page are unique sheets.

As a result, the template uploaded will be 3 pages with the sheets uploaded as follows:

Uploaded template sheets

As the number of forms increases, it is more helpful to use the folder prefixes spreadsheet to determine how to organize sheets into the template.

Understanding the folder_prefix spreadsheet

A spreadsheet named, “folder_prefixes”, is provided in the Cluster Tool results to analyze the output across a large set of forms. The folder_prefixes spreadsheet will show a summary of the folder results with the following columns:

  • Folder Name – The folders are numbered starting from 0 and increase depending on the number of clusters determined by the tool.
  • Number of Files – The folders are sorted in order of the number of pages in a cluster. Folder 0 (zero) will have the largest visual cluster and will descend as the folder number increases. The folders with single files represent a unique sheet to be uploaded.
  • Longest Common Prefix – If there is a common prefix amongst all the forms in a cluster, it may indicate a clear grouping of forms and allows the ease of organizing a template when dealing with multiple different form types.
  • First File – This column names the first file that appears in the folder. Since only one sheet should be uploaded to represent a cluster, the first file essentially provides a file upload list.

Folder name

Number of files

Longest common prefix

First file

0 2 form_cf form_cf_a_p1.png
1 2 form_cf form_cf_a_p3.png
2 1 form_cf_a form_cf_a_p2.png
3 1 form_cf_b form_cf_b_p2.png

The folder_prefix spreadsheet can be used as an entire forms overview, which can then be used to create a Version Discovery Report.

Create a Version Discovery report

A Version Discovery Report details the arrangement of all sheets into specific Template uploads. While the folder_prefixes may provide insight for which sheets to upload amongst all the near duplicates, users still must determine where the specific sheet lies in the template in terms of which template page. All the form pages must be stitched together into a template. For example, the above folder_prefixes report can be used to create the following Version Discovery report:

Template name


Sheet uploaded

# Sheets

form_cf p1 form_cf_a_p1.png 1



  p3 form_cf_a_p3.png 1
Total Sheets 4

The longest common prefixes help identify the organization of the template by clearly associating which template the first file would belong to. In the example table (above), all of the form pages have a common prefix of “form_cf”, so it is clear that all four files belong to Form CF. If the files were appropriately named by form name and version, then the template organization would be easy to map into a template.

Once this Version Discovery Report is created, the relevant template pages can be loaded into the Document Automation system for template creation.