16 Sep 2019 Talend Overview: Collect, Govern, Transform and Share your data with Talend
Since 2006 Talend has been on the market, offering its open source data integration tool (Talend Open Studio) and in 2009 it joined the Gartner Data Integration Magic Quadrant. Back in early 2010 they started introducing data quality products, which got them to the Magic Quadrant for Data Quality Tools in 2011.
This year Talend has been named a Leader in both the Magic Quadrant for Data Quality and, for the fourth year in a row, the Magic Quadrant for Data Integration!
Figure 1: Magic Quadrant for Data Quality Tools. Source: Gartner (March 2019) | Figure 2: Magic Quadrant for Data Integration Tools. Source: Gartner (July 2019) |
Talend comes in two versions: Talend Open Studio (TOS), which is a collection of open source tools with specific purposes that will help you extract, transform, prepare and load your data and much more, and Talend Enterprise, which is a single unified platform, based on subscriptions, and with more capabilities and specific improvements that we will go through later in the article. It is also worth mentioning that Talend has Cloud offerings.
First, we are going to list all the different tools that Talend provides with the Open Studio version. All these products/software are separated and independent from each other. After that, we will review Talend Enterprise and list the various advantages it brings compared to the open source products. Finally, we will briefly explain the possibilities Talend Cloud brings.
1. Talend Open Studio
Talend offers a set of open-source tools that can be used for free. We will now briefly explain what each tool can do:
- Open Studio for Data Integration: ETL/ELT tool with graphical UI to develop data integration pipelines. It has connectors to most known databases, systems and technologies. It allows file management, as well as execution and orchestration of data flows in which one can do transformation, aggregation, enrichment of data and much more.
- Open Studio for Big Data: Apart from all the ETL capabilities mentioned above, this tool includes specific components that will ease the interaction with Big Data tools and ecosystems, such as YARN scheduling, Hadoop Security for Kerberos, data interaction with data lakes, connectors to Cloud services, Hadoop components (HDFS, Hbase, Hive, Sqoop…), etc. And all that on top of the capabilities of Data Integration tool.
- Open Studio for ESB (Enterprise Service Bus): Includes all the capabilities of Data Integration tool, but also adds some REST Server components that will allow us to interact with REST APIS, WSDL, OAuth and more. The tool also works with HTTP, JMS, UDP, Apache Kafka and many other protocols and includes command line and scripting tools, apart from the drag-and-drop visual interface.
- Open Studio for Data Quality: Data Quality is a tool that will help with data profiling and analytics, with graphical charts and drilldown data. It includes some advanced data profiling such as fraud pattern detection, column set analysis, advanced matching analysis, time column correlation analysis, etc.
- Talend MDM (Master Data Management): This tool combines the power of MDM specific components with Data Integration in order to deliver a single version of the data across the rest of resources, both internal and external. The MDM platform also includes data quality components built in to ensure clean, useable and accessible data.
- (Web) Data Preparation Free Desktop: This web tool provides an easy-to-use visual interface to automate your data cleaning in a very simple and easy way. This Enables you to develop reusable data preparation transformations very quickly.
- (Web) Stitch Data Loader: This tool is also web-based and its objective is to load data in a very simple way from several cloud origins into cloud data warehouses and cloud data lakes in minutes. It is free to use for a certain amount of millions of rows/month.
As mentioned before, all these tools are isolated from each other. Each one of them is a standalone tool that is not integrated in any way with the rest.
2. Talend Data Fabric
Besides the Open Source tools, Talend also provides the Enterprise edition, called Talend Data Fabric (TDF). A complete solution for any of your data needs. It is a single platform that essentially combines all open-source components and much more:
- Easy connectivity, with more than 900 different connectors and components.
- Manages data across all kind of environments, both cloud and on-premises.
- Supports batch loading, real-time and streaming loading and big data use cases.
- Includes already built-in machine learning and data quality capabilities.
- The pricing model is predictable and user-based, without controlling the amount of data moved or the quantity of jobs executed.
This image shows the tool ecosystem inside TDF.
The tools inside Talend Data Fabric are integrated with each other, and they offer some new capabilities that the Open Source version does not have, aiding in a faster and better development of the code.
In the following section we will describe the key differences between Talend Data Fabric and Open Source.
3. Code reusability
In terms of helping to reuse code without duplicating it, Talend Data Fabric offers a new component called Joblet. A Joblet is a way of encapsulating recurrent processing steps or complex transformations in order to reuse the same code in several places or making a complex job more readable.
You can use a Joblet in different jobs and/or use it several times on the same job, but the code is written just once and shared across all these jobs, which is a lot easier to maintain and modify.
When the job is running, the Joblet code is integrated into the main job code. It is not a new Java class to run separately, it is the same as the main job, maintaining the execution context and variables instead of creating a new separate execution, as would happen with sub-jobs.
4. Code parallelization
Code parallelization in Talend Enterprise is far easier than in Open Studio. Instead of multi-threading a job, you can use the ‘tParallelize’ component to control all the executions that you want to carry out in parallel and control the synchronization (which is not possible in TOS) for when the parallel executions have finished. Here are some examples of these cases:
4.1. Multi-threaded execution
While using Talend Open Studio, in order to run multiple subjobs (with no dependencies between them) in parallel, we must activate the ‘Multi thread execution’ feature. Once the checkbox is activated, the subjobs will run in parallel.
If the machine does not have enough processing power to run all of them, some subjobs may be queued, waiting for the resources to be available.
With this option we have no way of synchronizing the execution of these branches and execute another component once the 4 subjobs have finished.
4.2. tParallelize component
In order to manage a more complex job-subjob system, Talend Enterprise adds the possibility to use the tParallelize component, which allows us to use it at any position on the code and control both the parallel execution of the different branches and also the synchronization of the execution once they have finished.
The ‘Parallelize’ links between components will define the different branches to execute in parallel from this component onward:
This component, also has a ‘Synchronize (Wait for all)’ link, which waitsfor all the ‘Parallelize’ branches to finish before continuing this branch of the execution.
5. Dynamic schema
In Talend Open Studio, reading a file will need the Read File component, where you must define the structure of the file to read. If the file changes or you want to read different types of files, you will need to modify this component and create one for each type. But Talend Enterprise includes the Dynamic Schema concept, which will allow us to read multiple files with different structures using only one Read File component, there is no need to create more components for each of the different types of file.
6. Version control
When talking about versioning the code, people say it is a bit of a nightmare to resolve file conflicts in Talend Open Studio, since it does not provide any built-in version control. It is up to you to manage these problems externally.
Talend Enterprise edition can be integrated with a remote repository (git/svn), making the version control of the generated code much easier.
7. Remote execution
While in Talend Open Studio all the executions run locally, with Talend Enterprise you are able to set up the execution of the code to be done on remote servers instead of your own local machine if you need it. You do not even need to move your code between different machines.
Basically, thanks to this we are able to run our code on as many servers as we want from our own local Studio without needing to deploy the code anywhere.
8. Management and monitoring
Talend Enterprise includes TAC (Talend Administration Centre), running on a server. TAC is a web-based application that helps with the administration of the Talend Studio projects, users and access to remote repositories, amongst several other capabilities.
With Talend Administration Centre you can centralize the users’ role management and access rights to the different projects and schedule and monitor the jobs that you want. You can even define the access that the users have to your data.
All users created via TAC will be able to connect to the projects they have been assigned to in the studio, where they can create the processes that will be launched remotely, scheduled and monitored by TAC.
Basically, Talend Administration Centre allows you to:
- Manage Users, Access, Projects and Job Execution
- Schedule Jobs
- Monitor job execution
- Logging (like data flow and changes)
9. Testing
While working with the Enterprise Edition, you can create “Test cases” to help you debug your code and develop faster. So, while using TOS you can only execute full jobs, and you have to manually isolate parts of the code and provide input/output for those in order to test them individually.
Meanwhile, if you are using Talend Enterprise you can simply select the components you want to test from the canvas. Those components will be the only ones to be executed in the “Test Case”.
Then, you can provide specific input and output for those components to be executed and tested.
This way you don’t have to manually modify the code in order to test some parts of the jobs, just select the parts you want to test and provide the input and/or output they need, which will help you save a lot of time during the developing.
10. Documentation
Talend Enterprise allows you to generate documentation of your jobs, instead of taking screenshots of the boxes on the canvas and pasting it in a text editor. You can generate this documentation with the tool and create your own document out of it.
11. Support
The final thing is, obviously, the support that Talend offers to those using the licensed version. They are available 24/7, so you can ask for help at any moment if you are facing an issue.
If you are using Open Studio, you will have to use the Talend Forum to ask for help for your issues and rely on the community. This resource is not bad, since there are a lot of developers out there using the Open Studio version, but if you are facing deadlines and problems that need to be solved fast, you may need a more specialised support.
12. Cloud
Talend also has Cloud Offerings: Talend Cloud offers an integration platform-as-a-service (iPaaS). It consists of the Talend Management Center which is a platform that runs on the private Talend Cloud and essentially acts as the TAC that we described above. Then one needs to configure remote engines which will be the ones actually running the jobs. Remote engines can run in the private Talend Cloud or for convenience (to have easier access to source and target systems) you can also run them on public Clouds such as AWS or Azure and also on-premise or on private clouds. Note this greatly reduces the burden of managing Talend platforms, and Talend users only need to take care of developing the pipelines.
Moreover, Talend Cloud can be used to create and expose Cloud API Services to interact with the platform and with the data.
Conclusion
In this blog article we have described the main capabilities that Talend offers, and we have discussed the most important differences between the Enterprise edition and the Open Source products.
As a summary, Talend Enterprise edition is helping the developer with code reusability, remote repository features, Talend Administration Centre to manage users and schedule jobs, improved testing capabilities and documentation generation. With the Enterprise edition also comes the support from Talend. Talend Cloud can host the administration of a Talend platform as well as its execution engines, though these can also be in public and private Clouds as well as on-premise.
On the other hand, Talend Open Studio offers a set of specialized tools that will help you achieve most of your needs regarding ETL, Data Quality and Big Data. The community is strong and alive, since there are a lot of developers out there using it.
So, Talend Enterprise offers some very nice perks that really improves both administration and development, reducing the risk factor in your strategic and critical processes. But that does not invalidate Talend Open Studio, it is up to you to decide which tools are better suited for your needs and how will you use them.
If you would like to know more about any of the topics raised in this article, please do not hesitate to contact us here at ClearPeaks – we will be glad to help you!