15 Mar 2023 Power BI & Snowflake Part 2: Hybrid Tables in Power BI with Snowflake
In the first part of our series, we talked about how to connect Power BI to Snowflake, and looked at some of the cool, useful features this can offer us.
One of these features is the possibility to create hybrid tables, and in this article we will explain what these tables are and how we can create them; they’ll allow us to visualise real-time data with the minimum use of resources.
First we’ve got to run through how to use incremental refresh, a great tool in its own right, but also the first step in creating hybrid tables. Let’s take a look!
Incremental Refresh
Incremental refresh is a way to update data efficiently in Power BI. It extends scheduled refresh operations by providing automated partition and management for dataset tables that frequently load new and updated data.
An incremental refresh is faster, more reliable, and consumes fewer resources. It will partition our dataset, dividing it into data that won’t be included in our report or “removed data”, data that will remain unrefreshed in our report or “archived data”, and data that will be refreshed in our report or “incrementally refreshed data”.
Note that a refresh will remove some data because it will no longer be in the “archived data” period; there will also be some data that will change from “incrementally refreshed data” to “archived data”, as well as some new data in “incrementally refreshed data”.
To set up an incremental refresh, first we have to import all the tables with the data that we need to Power BI Desktop. After loading, we select Transform Data on the Home ribbon. Once in the Power Query Editor, we select Manage Parameters in the Home ribbon, and here we create two new parameters called RangeStart and RangeEnd (case sensitive). In the Type tab we select Date/Time and give them both a date in Current Value; it doesn’t matter what date we put in Current Value because the parameters will change once we do the incremental refresh.
Then we make sure that the date field we are going to use for the incremental refresh is set to Date/Time type. For this field we select Date/Time filters > Between….and a new window called Filter Rows will appear. Here we must select is after or equal to, Parameters and RangeStart; then we select is before, Parameters and RangeEnd. After that, we have to click on OK. It’s important that we select is before and not is before or equal to, otherwise the incremental refresh won’t work because a row could satisfy the conditions for two partitions, which could lead to duplicate data in the model. Finally, we select Close & Apply.
Next, we must right-click on the table in the Fields ribbon and select Incremental Refresh. A new window will pop up, where in Select Table, we choose the table that we are going to use for the incremental refresh. We also must enable Incremental Refresh and in the Archive data starting _ _ before refresh date option we must set the period for which the data is to be imported for our report; this will define the “archived data”. In the Incrementally refresh data starting _ _ before refresh date option, we set the period for which the data will be refreshed; this will define the “incrementally refreshed data”. Finally, we select Apply and now the incremental refresh is good to go!
In the image above, there is an example where we have set “Archive data starting 5 years before refresh date” and “Incrementally refresh data starting 30 days before refresh date”. This means that even if we have data that’s more than 5 years old, only data from the last 5 years will be imported to our report, and when we refresh our report, only the data from the last 30 days will be refreshed; data from 5 years ago to 30 days ago will remain unrefreshed.
After we have done all this, we can publish our report; the incremental refresh will only work with Power BI Service. Once published, we must go to the workspace where we uploaded it and refresh the corresponding dataset. This first refresh will take longer, because all the data has to be imported first in order to make the partition later. After this, other refreshes will be faster, and we can set a scheduled refresh if we want.
Hybrid Tables
Finally, let’s look at what hybrid tables are and how to use them. Hybrid tables are visualisations that combine changing real-time data with static old data; in other words, they combine data obtained with DirectQuery and Imported data. That’s why hybrid tables need a database that allows us to store data with Import and DirectQuery modes, like Snowflake. It’s important to add that this feature is only available with Power BI Premium.
Hybrid tables are a modification of incremental refresh where there’s another type of data: “real-time data”. This means that when we refresh, there will be new “real-time data”; old “real-time data” will become “incrementally refreshed data”, part of the old “incrementally refreshed data” will become “archived data” and part of the old “archived data” will become “removed data”.
To see how to configure a hybrid table, let’s look at the following example:
Suppose that a delivery company has a dataset for each shipment with:
- The ID number of the shipment.
- The shipping date.
- The state of the shipment.
The state of the shipment can be:
- Delivered: If the shipment is completed.
- Missed: If the shipment didn’t arrive to its destination.
- In progress: If the shipment is being delivered.
- Not sent yet: If the shipment hasn’t been sent yet.
There are lots of different shipments every day and the company has information for the last 12 years. This company only operates locally, so if a shipment hasn’t arrived in 30 days we can assume it never will, and it becomes a “missed” shipment. We also know that shipments will be sent on the day they are ordered. The company wants to know how many shipments of each type it has handled over the last 5 years.
This is a perfect example of when to use hybrid tables. We have a large dataset, so it wouldn’t be very practical to import it all and periodically refresh it. Instead, we will import only the last 5 years, refresh only the last 30 days, and use a DirectQuery to get the real-time data of the present day. To do all this, we need to set up an incremental refresh beforehand.
The first step is to connect to Snowflake and create a DirectQuery to get the table with our data. If we are working with a star schema, we will do this with our fact table.
Then we go to Model and, on the Properties ribbon for our table, select Advanced Options and change it to DirectQuery. We won’t import the fact table now, because then we wouldn’t be able to create a hybrid table later.
As for the dimension tables, we’ll connect to Snowflake with DirectQuery and then set them as Dual. This way, the dimension tables will act as DirectQuery for the real-time data and as the Import mode for the rest.
Once we have done that, we can create an incremental refresh as we explained in the previous section, using the shipment date field as the field to apply the filter to.
Now, when we open the Incremental refresh and real-time data window, as well as all we’ve explained for the incremental refresh, we have to go to Choose optional settings and select the option Get the latest data in real time with DirectQuery, then click on Apply. Now we will get the data for the last value we chose for the incremental refresh. This means that if we choose to get the last 30 days as an incremental refresh, we will have real-time data for the last 30 days, but if we choose to get last month as an incremental refresh, we will have real-time data for last month.
Now we can publish the report (as we are using a Premium feature, we will have to publish it in a Premium workspace); like incremental refresh, hybrid tables will only work in Power BI Service. Then refresh the dataset, and the hybrid table will be ready to work.
For real-time data, it’s important to set a Page refresh that will automatically refresh our report page so we can actually see data changes in near real-time. This option is only available in the DirectQuery mode or the Dual mode. But incremental refresh, and also hybrid tables, are an option only available for imported data. If we are working with a schema where we have DirectQuery or Dual tables, besides our Imported table, we can set this option in Power BI Desktop. If this isn’t our case we can set this option in Power BI Server.
In our example, we have a unique table that is in Import mode. So, if we want to set up the Page refresh, we must go to the report and select Edit > Visualizations > Format your report page > Page refresh. Here we enable the Page refresh option and set the refresh time; in our example we’ll set it to every 20 seconds.
Now that we’ve done all that, if we add some new shipments to our data source in Snowflake, we will see how the report automatically changes.
Finally, to see how powerful this tool really is, we have created two reports for comparison purposes. Both reports use a 100MB data warehouse from Snowflake, and both have only a simple table. The difference is that while in the first one we have imported the data, in the second one we have created a hybrid table. If we add some data to the warehouse (now it has a size of 102.1MB) and refresh both reports, the one that has the hybrid table will take less than 10 seconds to refresh, while the other will take more than 5 minutes. What a difference!
This is hardly surprising when we consider what we’ve explained in this article. On the one hand, the report with the imported data will have to import all the data again in order to refresh it, whilst on the other hand the report with the hybrid table will only import the newest data in order to refresh it. The bigger the dataset, the bigger the time difference between the refresh processes! That’s why hybrid tables are a great way to optimise the resource consumption of our reports.
Conclusion
That’s the end of the second and last article in this mini-series. We have explained a way to leverage Snowflake in Power BI using hybrid tables, an option well worth bearing in mind offering real-time data using minimum resources.
And remember that this wouldn’t be possible without Snowflake. As we have seen, to create a hybrid table we need to store data as imported data as well as a DirectQuery. There aren’t many platforms that support DirectQuery natively, and Snowflake is one of them.
We will keep on exploring other Snowflake applications in our blog, and if you are interested in knowing more about Snowflake or Power BI and their potential, don’t hesitate to contact us – our certified expert consultants will be happy to help!