Advanced SAP HANA Modeling: How to Use the Query Results Caching Feature to Improve Performance
With each new release of SAP HANA come more features, many of which target improved performance of information views. This is especially true in caching. For a basic modeler who does not write code, SAP HANA Support Package Stack 9 brings many new options for how to best and more easily use query results. In this article, I show how to use this feature in the context of a view that uses smart data access and remote/virtual tables, as this is the primary use case. In addition, I discuss the performance advantages of using the daily and hourly query cache invalidation feature, as well as the drawbacks of these options when it comes to memory consumption and data latency (looking at old data). Finally, learn about a new option with Support Package Stack 11, the so-called transactional cache invalidation option, available with Support Package Stack 11 and newer. This is my preferred option, as it removes the latency drawback.
(Note:
The focus of this article is on the results cache. There is an even newer feature called the query view cache, available in Support Package Stack 11, but this requires code and is beyond the scope of this article.)
Definition, Background, and History of the Query Results Caching Feature
With the introduction of data warehousing systems came extract, transform, and load (ETL) tools that extracted data to the warehouse where it was stored physically in optimized structures for reporting, most often in a star schema. Given the state of reporting tools of the day, the reports on this summarized structured data were produced quickly. However, compared to the speed of today’s reporting systems, the evaluation of how fast a report is produced is measured in seconds, versus the minutes and hours this process took in the past. Even complex reports that connect many tables are now expected to run in seconds. SAP HANA itself, with all its memory, columnar data, and parallel processing, meets 99 percent of this need, but what about the +/– one percent of report processing that is still considered too slow? The new SAP HANA query results caching feature addresses this +/– one percent.
Explore related questions
Query results caching is a method for persistently storing the output of a SQL statement for reuse when the exact same SQL statement is generated again. This happens if, for example, the same view is accessed by the reporting tools with the same navigation and filtering. So, the first time you run a query (SQL statement), SAP HANA stores the output; this is called the cache of the results of the SQL statement. These results of a SQL statement normally persist just as long as you are using them, and are deleted when you execute another SQL statement. However, with this more advanced caching option set, you have the option to use this cache later (again), which can be quite handy.
The best scenario for caching SQL statements using modeling-based query results caching is when the following are true:
- Reusing the same query multiple times
- Using long-running queries on the database that access and join large tables
- Small result sets
In the case of number 3, there is a cost for using the results cache: The results cache uses memory and with SAP HANA, memory is not cheap. That said, if number 1 is a very significant, meaning your business runs the same query a lot and it is normally long running (#2), the caching function results in happier users and is worth the increase in memory consumption.
The basics of the query results caching feature were part of SAP HANA for a few years before the release of Support Package Stack 9. However, code was needed to implement it. With the release of Support Package Stack 9, a modeler-friendly option was added and, with Support Package Stack 11, still more choices came about. It is these view-specific options that are the main subject of this article, although there are still more caching options available. All these options are shown in Figure 1. (I only address the query results cache feature in this article; the other two options are remote cache and view cache, which are beyond the scope of this article as they currently require coding to implement.)
Setting Up the Query Results Cache Feature
Before discussing what is new in the query results caching feature, let me explain its basic setup. These basics are normally the domain of SAP HANA administrators as these configuration options are not set during the view-modeling process. As a modeler, you need to at least be aware of what needs to be done by others, so you’re not pulling your hair out when you follow directions and it still does not work.
To access the configuration settings, click the tools icon in the SAP HANA studio toolbar (not shown). The relevant ones in this example start with resultcache_, shown in Figure 2.
Again, as a modeler, your job is to work with your database administrator on these settings, and not set them yourself. One setting you should verify is the results cache enabled setting—make sure it set to yes as shown in Figure 2. You also need to check the default setting of the minimum execution time setting (resultcache_minimum_query_execution_time_in_milliseconds).
The default shown in Figure 2 is 100 milliseconds. This is the usual default setting, but in some situations, you might want to increase this number (in consultation with your database administrator) as the goal is to increase the speed of the long-running queries. The only time the system caches the results of a query is when the query runs longer than this value. So, you need to agree on the definition of a long-running query.
As mentioned before, the improved performance that comes with using the query results cache feature requires using a lot of memory. The maximum value-size setting (resultcache_value_size_in_bytes) causes the system to not cache results larger than this specified size, because your database administrator wants to help by allowing caching, but not if it takes so much memory it causes problems in other areas.
(Tip!
For a discussion of the other settings in Figure 2, refer to this blog post: https://scn.sap.com/thread/3459635. Included in this blog post is one additional configuration entry, the so-called whitelist setting that must be manually added to limit caching to specific objects. Although this creates a significant maintenance burden, it allows you to limit which views generate the cache, thus affecting memory consumption, but since it is not done that frequently, refer to the blog for more information on exactly how this is done.)
Query results caching stores the cached results for a specific query. This is not the case with SQL plan caching, in which the SQL execution plan is saved for subsequent reuse. Here we are talking about the data, not the SQL statements. So. what happens when the data changes?
The Query Results Caching Feature’s Options and How They Work in SAP HANA Support Package Stack 9
In the current version of SAP HANA, there are different options relating to the query results cache. Go to the View Properties tab by accessing the view and choosing the semantics node and you see the two information view options, Cache (on the left) and Cache Invalidation Period (on the right), as shown in Figure 3. Before Support Package Stack 9 these options did not exist and, with Support Package Stack 11 and newer, the Transactional option for the Cache Invalidation Period appears as a default, clarifying how the system behaves when the Cache check box is enabled.
When the Cache check box is checked and the Transactional option is invoked by default from the Cache Invalidation Period drop-down, the results of a query are persisted in memory (cached) until changes in the underlying tables of the view are made. With this option, your view’s data is kept up to date, but the cache needs to be rebuilt when a query is made after a table update. If changes happen to the tables often throughout the day, the impact of this increase in speed is defeated. When one of the other options (hourly or daily) for Cache Invalidation Period is chosen, instead of always invalidating the cache based on a data change and forcing the system to read the tables again, the tables are only re-read when the data in the cache is outdated, as follows.
There are two options for re-reading data, Hourly or Daily. If you chose Hourly, your query reads the underlying tables for new data unless the prior query execution cache was created less than one hour ago. If the cache is newer than one hour, then the system does not access the real tables (and does not get the most current data) but rather uses the results it cached under an hour ago. This means new data is not read unless it is in the database for more than one hour. If you set this option to Daily, it uses the same concept but with longer latency.
There are two scenarios when doing this is a good idea:
- The underlying data changes, but all users should only see the snapshot that happened on the first execution for the day so that everyone is on the same page in discussions about the data
- The data changes but it is OK for users to see the data that is old because they place a greater value on speed than on accuracy
Demonstrating the Query Results Cache
In my scenario, I wanted to add a little twist, so the calculation view that I use to cache the query results uses, as one component, a virtual table (KNA1) that resides on an Oracle database and not on SAP HANA. This adds value because in most cases reading data from an external source is one of the slowest things you can do, as the speed is based on the other system, not SAP HANA. In caching the results of the view, I am not only saving time in joining and aggregating (Figure 2), but also with the reading of the virtual table. My goal in this article is not to teach modeling—my focus is on caching. That said, I do show you how to create the virtual table, as this is not a well-known feature. I should also note that the remote cache of just the remote table (as shown in Figure 1) is also a good choice in my scenario.
Figure 4 shows the configuration of the remote connection to Oracle. Obviously, this setup, like everything in SAP HANA, is controlled via privileges.
Once a remote source is established, expand the remote source table folder as shown in Figure 5. Then, using a right-click to open the context menu, choose Add as Virtual Table.
The dialog box in Figure 6 opens where you can assign a name and a folder (schema) where the virtual table should appear.
Once a virtual table is created, it is consumed in a view (in my case a calculation view) the same way any other table would be. The only difference is the icon for the table. For virtual tables the table icon has a green-colored component, as shown in Figure 7.
The next step in my scenario is to build a simple calculation view joining the normal (in column-based) table of orders with the virtual table of customers. I am not showing how this is done using a virtual table versus a physical one, because it is no different. (If you do not know how to build a basic view you can attend the SAP Training class HA100.)
Once I create my view and set the Cache Invalidation Period setting (as shown in Figure 3), the next step is to test and retest it. This is because, as I mentioned in the beginning of this article, the underlying caching technology is in flux, and you need to make sure your SQL caches correctly.
To verify that a cache is used, first you need to make sure it’s created. In this scenario, using the context menu of the created view, choose Data preview… . This, under the covers, creates and runs a SQL statement to view all the fields, but because of a default setting, it also filters to just get 5,000 records. Then, to verify that this specific SQL was cached, I preview a special view called M_CACHE_ENTRIES view under the SYS schema (Figure 8).
The top entry in Figure 8 tells you that the system cached what was previewed from the simple view (SELECT TOP 5000 “KUNNER”,…). The READ_COUNT of 2 shows that this cache has been accessed two times so far. Note that the first time I ran this preview, the entry was created and 0 was in the READ_COUNT field. These entries are not shown but rather the counter is just updated with each access of the exact SQL. Then I ran this preview SQL two more times so the READ_COUNT shows 2. This is how you can verify that the cache was used.
It is important to realize that the other entries in the M_CACHE_ENTRIES view in Figure 8 were also executed against the same calculation view, but not against the exact same SQL. Only the exact same SQL uses the query results cache.
Now let’s see what happens if you wait for over an hour before running the preview (which executes the same SQL again), to see what if the cache counter is updated to 3 as you might expect (but would be wrong). In this case, I waited three hours and 25 minutes, to be exact. Figure 9 shows another entry for Mar 27, 2015 8:20:41. This entry was initially READ_COUNT 0 (not shown), because I previewed the data using the same SQL, but it was after the one-hour invalidation setting kicked in so another entry was created. Then I ran it once more, resulting in a read count that equals 1, shown in Figure 9.