20 Jun 2013 Hive vs. Impala with Tableau
As I explained in a previous post, Cloudera is an active contributor to the Hadoop Project and in this ecosystem they have launched Impala inside the CDH4 package.
Impala offers the possibility of running native queries in Apache Hadoop. It allows users to communicate with the data stored in HDFS using live SQL queries with no data extractions neither additional transformation. But for those of you that already knew Hive, this is nothing new right? Therefore, which is the difference? In the official web of Cloudera, Impala is introduced as the truthful solution for real time data analysis, as it offers a fast implementation process using SQL through Business Intelligence tools. In fact, Impala is not 100% a substitute for Hive (Impala does not cover batch process and ETL, which are offered by Hive) but it is the option that offers shorter execution time in SQL queries as well as better integration with leader tools in BI. I decided to make a test and compare Hive and Impala in the same environment, and for that I used Tableau. Remember that Tableau allows using any or both connections in its 8th version.
In what follows, I analyze the same data source and I launch the same SQL query with the same BI tool. In the web of Cloudera there is the option to download a VM with everything you need for a quick start with Hadoop (https://ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM).
This VM has Hive and Impala installed, so it is perfect for the comparison that I want to build in. Inside the VM, the initial appearance inside Cloudera Manager (the CDH4 app that shows all active services) is the following:
We can see that among the installed services we can find the main two Hadoop members: Mapreduce (mapreduce1) and HDFS (hdfs1). We also see that hive1 and impala1 are already started (it is necessary to start Impala services manually, as it is off by default).
Inside the HUE application (a query editor for Hive, Pig, and Impala that has a file explorer for HDFS) there are two available samples to download, with data on workers and wages. We will use these two tables as data sources for the comparison: sample_07 and sample_08. Thanks to this virtual machine it is possible to save plenty of time when installing the components one after the other. With all necessary services running alright, the next step is to perform the connection with Tableau.
Cloudera Hadoop connection in Tableau
Before starting, it is necessary to install the Cloudera Hadoop driver for Tableau. This is the link to download the driver: https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Tableau+Download
- Once downloaded, follow instructions to setup the driver.
2. Once installed, open Tableau and create one connection to Cloudera Hadoop for each database.
3. First I start with the Hive connection. The specific port for this connection is the 10000. Select the default scheme and the sample_08 single table.
4. As I mentioned before, I will create the same report for both connections in order to evaluate the execution time of the same query. The report is the following one:
5. Then, we carry out a second connection, this time with Impala. The specific port is the 21000.
6. I perform the same report showed above and compare the execution time of the queries:
As we can see, the query is the same, but the execution time is very different: with Impala it is almost 1 second, and using Hive connection it is more than 1 minute! It is a huge performance difference We have checked, therefore, that on equal ground, Impala is the best option in terms of performance. Impala by-passes the Map-Reduce layer in Hadoop resulting in much faster query response times than Hive. It’s not risky to affirm that most customers wanting to do ad-hoc visual analytics on Hadoop will turn to a technology like Impala.