Optical character recognition (OCR)

An OCR engine is available within Blue Prism for situations where it is not appropriate to use the native character recognition engine to interact with on-screen text. Commonly this will include scenarios such as where smoothed-text is enforced; or for interacting with scanned or otherwise-restricted copies of electronic documents.

The Blue Prism capability uses an embedded Tesseract OCR engine to recognise text using pattern matching and complex, language-based text recognition.

In order to maximise the effectiveness of the recognition a minimum of 300 dots-per-inch (dpi) is required. For images, such as on-screen text, where the dpi is lower than this, a Scale parameter will artificially increase the size of the captured region before passing it to the engine. Generally setting the scale factor to 4 or 5 will provide successful results.

The OCR engine is leveraged though a Read stage when used against a previously captured Application Modeller region and includes the options to read text, lists and grids. It is also possible to output the pre-worked images to a specific diagnostics location to allow verification that the scaling being applied is sufficient for the selected region.

Language packs

Language packs for use with Tesseract can be obtained from the internet. Blue Prism works with Tesseract version 4.0.0 and it is imperative that the correct major version of the language files are used with it. Currently, the version 4.0.0 language files can be downloaded from the Tesseract website.

To add support for another language, download the appropriate files and copy them to the Tesseract\tessdata folder (usually C:\Program Files\Blue Prism Limited\Blue Prism Automate\Tesseract\tessdata).

The language files are prefixed with a language code e.g fra (French), deu (German), jpn (Japanese), chi-tra (Traditional Chinese) etc. Once installed on each of the required devices, this code can be specified in the Language parameter of the "Read Text with OCR" action within a Read stage, to instruct the engine to use the required pack.

Page segmentation mode

The "Read Text with OCR" action within a Read stage has an optional text parameter Page Segmentation Mode, allowing a Tesseract-defined value to be specified. The values which can be entered in this parameter are shown below, along with a brief description of their action.

If no value is entered for the Page Segmentation Mode, then the default value of Auto will be used.

Parameter

Description

OSD

Orientation and script detection (OSD) only

AutoWithOSD

Automatic page segmentation with OSD.

AutoNoOCR

Automatic page segmentation, but no OSD, or OCR.

Auto

Fully automatic page segmentation, but no OSD. (Default)

Column

Assume a single column of text of variable sizes

VerticalBlock

Assume a single uniform block of vertically aligned text

Block

Assume a single uniform block of text

Line

Treat the image as a single text line

Word

Treat the image as a single word

CircledWord

Treat the image as a single word in a circle

Character

Treat the image as a single character

SparseText

Find as much text as possible in no particular order.

SparseTextWithOSD

Sparse text with OSD.

RawLine

Treat the image as a single text line, bypassing workarounds that are Tesseract-specific.

For further information on segmentation modes please consult the official documentation provided by Tesseract on their website.