First Steps With Talend

Introduction

Talend in an open source tool data integration tool (also known as ETL or Extract Translate Load). It provides a host of features which greatly simplify the job of moving data from one source to another, whether the data is stored in database, spreadsheets or XML files (etc). It is perfectly placed to help with the job of extracting data from a company's internal data stores, translating it into a form which can be used within OpenActuary, and loading it into an OpenActuary staging area.

This tutorial will take you through process of installing talend, and gets you up and running with a project that can access the Metropolitan Insurance staging database which is included in a default OpenActuary installation. We'll also include some pointers to other useful tutorials and online resources to help you get started with using Talend in anger.

The tutorial assumes that you have no previous experience of Talend.

Installing Talend

Talend is available in a number of flavours to support different kinds of ETL related activities. The one we're interested in is "Talend Open Studio for Data Integration". Talend runs within the Eclipse IDE framework but it cannot be installed as an Eclipse plugin. However, downloads are provided for Windows, MacOS and Linux.

You can download it here. Version 5.3.1 or later is recommended. Once you've downloaded the .zip file, unpack it into a convenient folder on your machine.

In Talend's installation folder you'll find a variety of ways to start the tool depending on your operating system. Which ever of those you choose to run, you will ultimately arrive at Talend's startup screen:

Creating a Project

The first thing we will need is a project to work in. Projects in Talend, in common with other Eclipse based tools, are arranged into workspaces. From this screen you can change the location of the folder where workspaces are stored, but for now just accept the default on this, and simply click the "Create..." button.

Enter a project name of "ARDR Motor" and click "Finish". You'll be returned to the start screen with the new project pre-selected. So you can simply click "Open" to open the project.

Now you get the option to connect to the online community. You'll get this option every time you open a project if you're not already connected. You might want to connect later, but for now just hit "Skip".

Talend will now load, and after running some project setup steps it will present you with the welcome screen. There are some useful links here to tutorials and online help that you might want to explore later, but we want to get into the tool itself right away so click "Start Now!".

Talend now opens showing your new project. If you're familiar with Eclipse already, you'll feel at home right away. If not, you essentially have a panel top-left which displays the content of your project. The "Repository" view which you see there now gives you a structured overview of the components of your project. The "Navigator" view lets you interact with the files directly. We'll still with the "Repository" view for now.

The centre of the screen which currently shows the welcome message is the work area where we'll be designing the ETL job. The bottom of the screen shows status and property editors, and the right hand side will display a palette of ETL modules once we get going.

Right click on "Job Designs" in the "Repository" view, then select "Create Job" from the context menu.

Fill in the Name field as "UploadToMotorStaging" and leave the other fields as they are and click "Finish". As the message says, it is advisable to fill in the Purpose and Description fields and it is probably good practice too,  but Talend renders these in tooltips over menu items when you're working on a project. So they actually get in the way and are better off left blank.

Your new Job now opens in the job designer, and the panel on the right of the display is populated with ETL modules that you can add to your design.

Creating a Database Connection

Before we get into designing the job itself, we need to configure a database connection and import the table structure into Talend. To do this, start by adding a new connection. Click on the "Create" icon in the icon bar

And then select "Create connection" from the menu:

You'll be prompted to select a folder for the new connection, just click "OK" on this dialog to open the "New Database Connection" window.

Enter the new "MotorStagingDatabase", and click "Next >".

In order to fill this form in we'll need to login to OpenActuary's web interface as the user who will be running this ETL job and fetch the database details form the data source view. In this case we're going to be running the job as the user "Lily" who works for Metropolitan Insurance. Login in as her, and going to the "Sources" tab for staging database gives us the following screen detailing the database connection to use.

Copying these details into Talend's "New Database Connection" window results in the following:

Click the "Check" button to test the settings and ensure that Talend can see the database. This should report a successful connection. If it doesn't then check that you have the database settings exactly as shown on OpenActuary's Sources tab, and also make sure that you have MySQL running.

Now click "Finish" to save the database settings into the project.

Importing a Database Schema

Next we need to import the databases' schema into the project. To do this, right click on  "MotorStagingDatabase 1.0" in the Repository view under "Metadata/Db Connections", and select "Retrieve Schema" from the context menu. Note: if you don't get a context menu then save your project and restart Talend. After the restart the context menu will work correctly.

On the first screen of the import wizard, accept the defaults and click "Next >"

On the second screen of the import wizard, click on the icon next to "MetropolitanInsurance_Motor_Staging to open up the tree listing the tables. Then click "Select All" to import all of the tables into the project, then click "Next >"

and on the last screen of the wizard simply click "Finish". You have the option on this last screen to add or remove elements from list that will be imported, but in this case we will simply import them all.

Creating the ETL Job

Now that the database is setup and the schema imported, we can start to design the job itself.

We want to sent the output of the new job into OpenActuary's motor staging database. So, click on "Databases" in the palette list, then scroll down until you find MySQL. Click on that to open it, and you'll get a list of MySQL specific modules.

We want to send our output to MySQL, so look for the module named "tMysqlOutput". Drag that module onto the job design.

Double clicking the database icon on the newly added module will open the component details view at the bottom of the window. Here we can configure the database that this component will output to.

In the component view, set "Property Type" to "Repository" and click the ellipsis that appears over to it's right. From the resulting dialog, select the "MotorStagingDatabase 1.0", and click OK. The database connection details will now be filled in based on the connection we defined earlier.

Again in the component view, click the ellipsis next to the "Table" text area. Expand the table list in the resulting dialog and select "polPolicy". The root element in all OpenActuary databases is the policy table. We will add more tables to the job later, but this is the starting point.

In the component view again, set "Schema" to "Repository". Talend will automatically select the schema we imported earlier.

The component view should now look like this:

The database connection is now ready to be used from the ETL Job. The only thing left to do now, is to find some data to load into it!

22 Attachments
1117 Views
Average (0 Votes)
Comments
No comments yet. Be the first.