Skip to main content

Query caching for performant projects

Use query caching to optimize time and database spend.

tip

This tutorial is available as a published Hex App here.

SQL query caching

ETL schedules are sometimes different for different tables, which leads to opportunities for caching optimizations.

Say we have a very expensive query against table1 and table2, which are derived tables only updated once a day, at midnight. However, we want to join table3 into the mix, and table3 is updated in near real-time— people looking at the output expect to see their reports consistently staying updated. We obviously can't cache the entire query on a 2-day cadence, but we also don't want to run a slow and costly query every single time someone loads the app.

This tutorial will walk you through different ways to configure cache settings for your queries and how to use Dataframe SQL to optimize your Hex app with fine-grained control over query caching.

Configure query cache settings

By default, caching is disabled so your query will execute with every run of your project. Enable caching in the top right of your SQL cell and you’ll be able to select different configuration options:

  • On initial app load: The query will only execute and update the cache when a user first loads (or refreshes) a published app. Any other run of the project will use the cached values of the query.
  • On scheduled run: The query will only execute and update the cache on the schedule run of the project. Any other run of the project will use the cached values of the query. This is not an available caching option unless you have configured your project to run on a schedule, as described here.

Check out the documentation on caching options here for an example.

Use Dataframe SQL to optimize

There is a way to get even more clever! Using a chained SQL approach, we can split a query up into 2 separate SQL queries that are cached separately, and then join them together using Dataframe SQL. We can then query against any dataframes in a Hex project, including the results of previous SQL queries.

First let’s make some queries to try this method out.

Step 1: Create an expensive query

We don't want to unnecessarily waste compute resources, so we'll just run a mildly expensive query. This query returns the average daily "Goldstein Scale" (a measure of geopolitical stability) observed in the media prior to 2021. This will be our "baseline" for an average day.

This query processes ~10GB of data and takes around 10s to return, which is too slow for an app. We'll cache it aggressively on a scheduled run. This one really never needs to be refreshed, since it's all historical data.

Here's the SQL query:

SELECT EXTRACT(DAY FROM _PARTITIONTIME) AS day_of_month,AVG(GoldsteinScale) AS goldstein, AVG(AvgTone) AS tone
FROM `gdelt-bq.gdeltv2.events_partitioned`
WHERE DATE(_PARTITIONTIME) < "2021-01-01"
GROUP BY 1

Step 2: Create a cheap query

Let's compare today’s data for the same metric to that historical data to see how today compares.

SELECT EXTRACT(DAY FROM _PARTITIONTIME) AS day_of_month, AVG(GoldsteinScale) AS goldstein_scale
FROM `gdelt-bq.gdeltv2.events_partitioned`
WHERE DATE_TRUNC(DATE(_PARTITIONTIME),month) = DATE_TRUNC(CURRENT_DATE('America/Los_Angeles'), month)
GROUP BY 1

This is a very cheap query, only processing 15MB of data! We aren't worried about that impacting our DB spend, so we won't bother caching it very aggressively. It still takes ~5 seconds (shout out to columnar databases for making it just as fast to parse 8GB as 15MB), so we'll keep it cached on the first run of the app.

This means that every time a user loads the app, it will run this cheap query and update the data to the current day, but won't run the large, multi GB query, which is cached to run only once a month.

Step 3: Use Dataframe SQL to merge the two results together

This Dataframe SQL query merges the two queries together, allowing the smaller query to remain uncached and real-time, but keeping the large and expensive query cached.

SELECT s.day_of_month, s.goldstein_scale AS this_month, b.goldstein AS historical, FLOOR((s.goldstein_scale/b.goldstein)*100) AS percent_of_historical
FROM short_fast s
LEFT JOIN big_slow b ON s.day_of_month = b.day_of_month
GROUP BY 1,2,3
ORDER BY 1 ASC

Dataframe SQL is in-memory and very fast, so it incurs zero DB spend and returns in < 1 second.

Looks like so far, this month is off to a much more positive start than baseline. Nice!

What's next?

This is a very specific example of a feature that can be applied to many situations. Think about the apps you've built: are there any large queries that could be broken up and cached independently to create a faster end-user experience, or reduce database spend?

You can review the docs on caching here, and Dataframe SQL here.