ETL in the Cloud operations

What is ETL?

ETL (Extract Transform Load) is a data-warehousing term for working, transforming, and finally loading data.

In Blue Prism Process Intelligence, ETL is a feature where a user can upload small or large files or multiple files, zipped or unzipped, and once uploaded they can work with the uploaded file as they would browse a database table. That means the user can run operations on the uploaded file. For example, joining two columns into one or trimming. After transforming the data the user can choose to load the transformed table into a project or multiple projects. ETL is a feature for advanced and big-data uploads. This is because compared to the regular file upload, type detection is not run automatically and timelines are not created from the data, so the user needs to understand the data. Users can upload compressed or big data files and then work on the raw data before creating timelines within a project.

Repository

A repository is an abstract container which is similar to a project. Process Intelligence holds tables in a repository, where each table represents one or more data files which the user uploaded. A user is able to have more than one repository and each repository may have multiple tables. The user can switch between repositories, and a repository could have more users. A repository owner can invite users to collaborate on a repository. For more information, see Getting started with Process Intelligence repository.

To open the repository, click View > Repository.

Tables

A repository table is a database table. In the most simple case a user uploads a CSV file into a repository, and Process Intelligence creates a database table with the contents of the CSV file. It then becomes a repository table. To switch to table view, click the Tables option in the Repository menu, then select the table you require. You can add different operations for each table by clicking Add operation and selecting an operation from the list. 

To configure the operation, specify the necessary parameters. You can add some explanatory text for your operation, which will replace the default description. In the figure below, a user has added custom explanatory text for 2. Delete column.

When you finish setting up the operation you may enable a preview of the operation. Process Intelligence will then run that operation on a temporary data table. The original data table is retained, but you will see a temporary data table which is only 1000 lines long. The operation previews on this temporary table and it resets after every operation preview.

To run an operation, click an operation in the Operations pane and then click one of the following actions:

  • – Runs the selected operation.
  • – Runs all the operations, from first to last.
  • – Moves the selected operation one step back or forward.

The operations are numbered and will be run consecutively in the order in which they appear in the Operations pane. For example, you will not be able to replace a substring in certain columns if these columns have already been deleted by a preceding operation.

Operations

You can perform various operations on tables in your repositories before loading their data into a project. For example, you may wish to delete redundant columns, format some fields, or modify some field values, or you may choose to load only data related to a particular department or employee.

The operations that you have chosen to be performed on a particular table will be displayed in the Operations pane.

  • Change case – Converts the field value to upper or lower case. For strings only.
  • Change type – Converts the selected field into the specified type.
  • Combine timestamp – Creates a Timestamp field from separate Date and Time fields.
  • Create timestamp – Creates a Timestamp field from a text field using format expression.
  • Date add – Adds or subtracts a date part from a date.
  • Date diff – Creates a new field by calculating the difference between two dates.
  • Delete records – Deletes records based on one or all criteria.
  • Delete by timelines – Deletes timelines based on a criterion.
  • Delete column – Deletes selected columns from a table.
  • Delete duplicates – Deletes duplicate records.
  • Derive field – Creates a new field by combining several fields and the fixed text.
  • Extract substring – Extracts the substring from current substring. For strings only.
  • Join – Adds field from another (child) table to parent table.
  • Load into project – Loads the timelines to a new or existing project
  • Remove substring – Removes the specified substring from the field value. For strings only.
  • Replace substring – Replaces the specified substring with another substring. For strings only.
  • Round timestamp – Rounds the timestamp field to the specific units (second, minute, etc.).
  • Transpose – Converts single record with multiple selected fields into multiple records.
  • Trim – Removes extra spaces on left and right.
Substrings

Some operations can only be performed on substrings. A substring is a sequence of characters of any length within a larger string. The table below shows a string and its possible substrings.

String Possible substrings
Raleigh, North Carolina

Raleigh

North Carolina

North Car

gh, N

R

As part of the T (transform) in ETL the user can perform various transformations on the data table. A common need is to concatenate two columns into one single column. For example, concatenating a column that has data for a Date without Time with a column which contains a Time. The resulting column is a DateTime field type which is a required data field by Process Intelligence.

To-do list

After you made some operations on the data table, you can choose to save the sequence of those operations as a to-do list for later use on the same file or others which require the same operations. Click Save as to-do list in the Operations pane.

You may upload:

  • A single CSV file
  • Multiple CSV files zipped
  • A single CSV file zipped

Role management

A repository may have any number of users and a user could be a part of multiple repositories.

Users' rights in a repository could be one of the following:

  • Data manager – Can view existing data and upload new data.
  • Admin – Can view existing data and upload new data, as well as add or remove other users and change their rights (except for Owner).
  • Owner – The person who created the project. A project can only have one owner at a time. The ownership can be transferred to another user. An owner has full Admin rights. If the current owner changes their role to any other, they will be prompted to specify the user to transfer ownership of the repository.

Any admin or owner can make another user an admin or remove the admin right from them. Admin rights cannot be removed from the owner. An admin can change the rights of other users by clicking in the drop-down lists in the user grid.

Admin can also add another user to the project by typing their email and clicking Enter. A new user is then created with the Data Manager rights.

If the user already has an account in Blue Prism Process Intelligence, they will receive an email notification about access to a new project.

If the user doesn’t have an account, a new account is created in Recovery status, then the user is sent an email with the temporary password – exactly the same process as when a user recovers a forgotten password.

History

The log of all changes to the repository will be recorded here, which includes all new users, deleted users, as well as the addition and removal of tables from the repository.