SQL cells
Query your warehouse, uploaded files, or dataframes directly with SQL.
- Users will need Can Edit permissions.
Hex has first-class SQL support via SQL cells, each of which is a fully fledged query IDE complete with autocompletion, caching, a Data browser, and more. Projects can have an unlimited number of SQL cells interleaved with Text, Code, Chart, or Input cells, creating an extremely flexible querying environment.
SQL Cell Principles
Hex's cell-based structure, along with a couple other Hex-specific features, makes for a more powerful but slightly different querying workflow than you may be used to. There are five key principles to understand about writing SQL in Hex:
- There are two kinds of SQL queries in Hex: Queries can run against a remote database or warehouse (data connection SQL), or against a local dataframe or .csv (dataframe SQL). Together, these two options let you execute some very powerful workflows.
- Data connection SQL queries can return results as dataframes or queries: By default, SQL queries return their entire result set as a dataframe that can be operated on in memory. When working with larger datasets, it's more efficient to put a SQL cell into "query mode", which instead returns a query object— a lightweight "pointer" to the complete result set.
- SQL queries can be chained together: Any SQL query can refer to previous SQL queries in a project to form "chains" of queries, much like a complex query might include many CTEs. You can use this chained SQL technique to split logic up atomically and make complex queries more readable.
- SQL queries can be parameterized with project logic: Values from other parts of project logic can be inserted into SQL queries using Jinja syntax. This lets you parameterize queries based on user input, the output of an earlier query, or Python code to build complex user-facing data applications powered by SQL.
- SQL queries are reactive to updates and user input: Queries will reactively re-run whenever their inputs or dependencies change, making for fast iteration and immediate feedback loops. Project state will always be up to date, even across many SQL, Python, or other cells.
When should I use data connection SQL vs dataframe SQL?
This decision is often made for you by the location of data. If data is in a database or warehouse, you must use data connection SQL. If it's in a dataframe or a .csv, you must use dataframe SQL.
However, you can build clever and complex things by combining these two in more fluid ways, from "Chained SQL" workflows that replace subqueries to powerful cross-database joins. Some examples of when you might reach for dataframe SQL over data connection SQL:
- If you need to run an intermediary operation in Python. Say you run a Snowflake SQL query, and then need to geocode a column using Python. Dataframe SQL will let you write SQL against the output of that intermediary geocoding operation, even though that data is no longer in Snowflake.
- If you need to query data from a
.csv
file. Dataframe SQL can directly run queries against uploaded files, with just aSELECT * FROM "file_name.csv"
. - If you need to join data across different data sources. You can run two data connection SQL queries, one against a Snowflake connection and one against a Redshift DB, and then join the two dataframe outputs together using a Dataframe SQL query.
The biggest reason not to use dataframe SQL, if you're in a situation where you can write data connection SQL instead? Dataframe SQL queries must be written in DuckDB's PostgreSQL-like flavor of SQL, which doesn't support some database-specific functions— or just might feel unfamiliar to you. Also, dataframe SQL requires the dataset be loaded in-memory, which precludes extremely large datasets from being queried this way.
In general, we hope that it Just Works, but this matrix may be helpful when trying to decide which mode to use:
Target Use | Connection SQL | Dataframe SQL |
---|---|---|
Can talk to SQL databases | ✅ | ❌ |
Can use database-specific functions | ✅ | ❌ |
Can query Python dataframes and variables | ❌ | ✅ |
Returns results as a dataframe | ✅ | ✅ |
Can join data across sources | ❌ | ✅ |
Works with extremely large datasets* | ✅ | ❌ |
Speed on small — large datasets | 🏎️** | 🚀 |
Results can be cached | ✅ | ❌ |
* To be clear, we're talking of upwards of tens of millions of rows / tens of GB here.
** We don't know how fast your DB is, but we hope it feels like a Formula 1 🙂.
When should I use query mode vs. dataframe mode?
Dataframe mode (the default) should generally work great for most queries on small-medium size datasets. Query mode should generally be used when working with large datasets.
Query mode queries are great for working with a large amount of data because it allows you to leverage the compute engine of your data warehouse. You can use query mode to do end-to-end analytics in SQL and no code cells. Alternatively, use it to aggregate large amounts of data that can then be brought into memory for python, dataframe SQL, or writeback workflows.
Dataframe SQL queries always return dataframes, which are compatible with all cells in Hex. Use dataframe SQL:
- for any work with small-medium datasets
- if you want to use Hex's memory instead of your data warehouse's
- to do work in python, dataframe SQL, or writeback cells
- to aggregate the results of a query mode object and bring it into memory
When should I chain SQL?
Hex supports a workflow called chained SQL, where a SQL query references another SQL query from the project in the FROM
clause. This functionality is slightly different across data connection and dataframe SQL, but the end result is the same: you can break large SQL queries up into individual cells that are clean and easy to debug. There are many examples and details in the reference section, but the two main use cases of chained SQL are easy to understand: simplifying complex queries, and letting SQL users 'drill down' into data.
Chained SQL can be used to break complex queries into atomic units that are easy to read, understand, and debug. A query with three CTEs can become four chained SQL cells, each able to be run, edited, cached, and debugged separately.
Chained SQL also lets users explore and iterate without rewriting code. Run a query with an interesting output that you'd like to drill deeper into? Traditionally, you'd copy that entire query and change it to fit your new format. If you had to aggregate at different levels, you'd begin to write a CTE (see point #1). With Chained SQL, you simply add a new SQL cell and write a fresh query against the interesting output.
This is available for dataframe SQL and data connection SQL, but is evaluated somewhat differently — dataframe SQL is actually running against the materialized output of previous queries, while chained data connection queries dynamically compile to CTEs in a final query executed on the remote connection. In practice, this doesn't have too much of an impact on user experience outside of the fact that staying in data connection SQL means you can keep using warehouse-specific SQL syntax features. Read more in the reference section of this doc.
Query reactivity and caching
Hex continuously and reactively re-executes cells when their dependencies and inputs change. This makes intuitive sense for Python code in the context of a notebook, but can be an adjustment for SQL users that are used to having to click "Run". Here's what you really need to know:
- This keeps the state of the notebook up-to-date. You get fast feedback and cells are never out of sync with each other.
- Dataframe SQL queries are "free" and fast. These queries run in-memory, in Hex, and never hit your warehouse. You can run a multi-million row dataframe SQL query ten billion trillion times and it won't cost you a cent (it will, however, bankrupt Hex). They are extremely fast, generally on the order of hundreds of milliseconds, so there's no need to worry about them reactively executing.
- Data connection SQL queries can be cached. You can cache a query either "on initial app load", so it will execute against the connection once in each session, or "on scheduled run", so that it will only execute against the connection twice a day, or once a week. This can save both money and time when working with large SQL queries. More on caching here.
- Only SQL queries that have had a dependency update will re-execute. If you run a massive
SELECT *
in the very first cell of your project, that query won't re-execute every time you change the input parameter halfway through. It's still a good idea to cache, but part of the magic is that Hex only ever runs necessary cells. - If you are working with huge, scary data, you can enable "Cell Only" run mode to prevent reactive re-execution of cells.
SQL Cell Reference
Data connection SQL
This is the SQL option you are probably most familiar with. Once you have configured a data connection to your warehouse or database, you can query it though a SQL cell. If your workspace has set up shared data connections, these connections will also be available to choose from.
First, add a new SQL cell and select a data source:
You can then write SQL directly in the cell, using the SQL dialect of the selected connection:
SQL cells within a project share the same database session, allowing for multi-statement queries and alter/update workflows.
Running an uncached SQL cell will run that query against your database. To avoid accidentally running expensive queries every time you re-run a project, you can cache query results.
If you know you'll be running a lot of queries while developing, it can be smart to add a temporary LIMIT clause to potentially large queries.
SQL cells have an integrated preview which shows up to 10 rows.
Query mode
Data connection SQL cells can return results in two different ways: dataframe mode or query mode. Dataframe mode, the default, is great for most use cases. Query mode is powerful when working with large result sets.
In dataframe mode (the default), SQL cells return their entire result set in-memory as a pandas dataframe. If you run "SELECT names FROM fruit_table" in dataframe mode, there will now be an object taking up space in the project's memory that contains a few hundred fruits banana,apple,pear,etc
. You can operate on that dataframe anywhere in Hex— reshape it in a Python cell, query it using dataframe SQL, visualize it in a Chart cell, etc.
In query mode, a SQL cell returns a "query" object that serves as a pointer to the complete result set. The query results reside in your data warehouse, allowing you to work with large datasets while leveraging the compute engine of your data warehouse. You can continue your analysis without bringing data in Hex’s memory by using and returning query objects in SQL cells with a data connection SQL, chart cells, table display cells, pivot cells, and filter cells.
Referencing a query object in a python code cell, SQL cell using dataframe SQL, or writeback cell will pull all of the data into memory. When working with very large datasets, be mindful when using these cells on a query object to avoid an out of memory error.
Dataframe SQL
Hex also lets you write dataframe SQL against any dataframe in the project, including the results of previous SQL Cells.
Dataframe SQL queries execute right in Hex using DuckDB, so they don't incur any usage costs on your data connections. The SQL syntax is very similar to PostgreSQL, and the full reference can be found in the DuckDB docs.
Wondering whether to use dataframe SQL or not? Check out the SQL Principles section of this doc.
To use dataframe SQL, add a SQL cell and select Dataframes as the data source.
You can then reference any dataframe in the FROM or JOIN clauses, using the dataframe name where you would normally use the SQL table name.
Dataframe indexes are not returned or accessible in a dataframe SQL query. This also applies to multi-index dataframes, and although queries will not error, you can lose important nested context if you aren't aware of this. If you need to preserve an index, we recommend calling reset_index()
on the dataframe before querying it with SQL.
You can join across dataframes, but you cannot directly join across dataframes and database tables. If you want to join a dataframe and a db table, you'll need to first run a data connection SQL query to get that table as a dataframe in your project. Once everything is a dataframe, you can join the two together.
Caching is not available for dataframe SQL. This is because dataframe SQL doesn't hit your database, so caching wouldn't protect you from any accidental spend.
Also, it's because dataframe SQL runs in-memory and tends to be very quick :)
Chained SQL
Chained SQL is a common workflow in Hex that allows a SQL query to reference other SQL queries from the project in its FROM clause. This functionality is slightly different across data connection and dataframe SQL, but the end result is the same: you can break large SQL queries up into individual cells that are clean, efficient, and easy to debug.
For advice on when to use chained SQL, see the SQL Principles section of this doc.
Chaining dataframe SQL
Dataframe SQL queries are often chained, and it's seamless to do so. Every SQL query can return a dataframe as its output, and because dataframe SQL (duh) can query dataframes, you can query the results of any previous query. Say you have a .csv
file, users.csv
. You write a dataframe SQL query against it to check it out, and name this query users
SELECT user_id,created_at,last_seen_at, monthly_payment from "users.csv";
WHERE is_customer is TRUE
When you go to write another, more in depth query, you don't have to write it against users.csv
anymore. You can write a chained dataframe SQL query directly against the users
dataframe:
SELECT DATE_TRUNC(month,'last_seen_at') AS month, COUNT(*)
FROM users
GROUP BY 1
This is a simple example, but you can see how, in a complex project, this could get useful. Especially if you dabble into Python; imagine you have Python code that generates cohorts from that users query:
cohorted_users = generate_cohorts(users)
Rather than be stuck in Python, you can jump right back into SQL using a dataframe SQL cell:
SELECT cohort_id, SUM(monthly_payment) AS cohort_value
FROM cohorted_users
GROUP BY 1
Chaining data connection SQL
Chained SQL works great with Query mode, but SQL cells do not need to be in query mode to be chained using data connection SQL. Even if a SQL cell is returning results as a dataframe, it can be chained in a data connection SQL query.
In practice, chained SQL queries work almost exactly the same for data connection SQL and dataframe SQL. There are just three important differences.
-
When chaining data connection SQL queries, all queries must be running against the same data connection. You can chain three queries against a Snowflake warehouse. You cannot chain two Snowflake queries with one Redshift query (to do that, you'd use dataframe SQL).
-
Chained SQL for data connections uses the SQL syntax of the connection. For many use cases, this is the primary differentiator from dataframe SQL queries, which have to be written in DuckDB's PostgreSQL-like flavor of SQL.
-
When chained, data connection SQL writes a query with CTEs. Imagine a Hex project with two SQL cells, both querying a Snowflake warehouse. The first SQL cell, called
customer_order_facts
, is defined as
SELECT customer_id, MAX(order_total) AS most_expensive_order, SUM(order_total) AS lifetime_order_total, COUNT(*) AS count_orders
FROM orders
Later in the project, another SQL cell called order_total_distribution
looks like this:
SELECT lifetime_order_total, COUNT(*)
FROM customer_order_facts
order_total_distribution
is now a Chained SQL query. It is written against a Snowflake data connection, but it is not referencing a table in the Snowflake warehouse; It's referencing another query from the same Hex project. When run, Hex will compile this query into:
WITH customer_order_facts AS (
SELECT customer_id, MAX(order_total) AS most_expensive_order, SUM(order_total) AS lifetime_order_total, COUNT(*) AS count_orders
FROM orders
)
SELECT lifetime_order_total, COUNT(*)
FROM customer_order_facts
You can see these compiled queries by clicking "View compiled" in a SQL cell.
Troubleshooting Chained SQL
At times, Hex may not be able to compile CTE references correctly. To check if this is happening, view the compiled SQL via the view compiled icon.
Often, this is the result of a CTE sharing a name with a reserved SQL keyword, which results in the SQL parser being unable to swap the CTE name with a query.
To workaround this, replace any CTE names that use reserved keywords.
Query datatypes
When query results are returned in-memory, they are stored as Pandas DataFrames, which requires that some database-unique column types be transformed into pandas supported datatypes. We try to keep the conversion between database column types to pandas type as consistent as possible between database types.
Generally, the following mappings are applied:
database column type | pandas column type |
---|---|
int | int |
int (with null values) | float |
float | float |
decimal | float¹ |
string | object |
date² | date |
timestamp | timestamp |
¹By default, decimals will be cast to floats. This behavior can be turned off — see the below section for more information.
Some notable differences between database column types and pandas data types:
- Any
int
type column which contains nulls will be forced to afloat
type in the output dataframe. This is due to how pandas handles integer columns. - Depending on the database, decimal-type columns are returned with varying default number of decimals places (PostgreSQL: 18, BigQuery: 9, Snowflake: 1)
- Datetimes are converted to Pandas integers at nanosecond precision. The number of nanoseconds required to represent far future dates exceeds the maximum allowable number for an
integer
type. If we detect such a large datetime, we convert all datetime columns in a dataframe to anobject
type to avoid causing a conversion error.
Infer numeric datatypes
When enabled (the default), any column with a decimal type will be cast to a float. This is because Pandas does not support decimal datatypes. In general, casting a decimal to a float makes the number easier to work with downstream, and easier to format in our Table display cells.
However, there are times when casting a high-precision decimal to a float will result in loss of precision. In this case, you can disable the Infer numeric datatypes option in the SQL cell configuration options. When this setting is turned off, decimals will instead be represented as an object type.
Use native Python dates
When enabled (the default), date columns in a SQL result will be represented as a Date
type, with no time, or timezone component, e.g. 2022-09-12
. This allows downstream cells, such as Pivot and Chart cells, to handle these values in the expected way. This differs from the standard Pandas dataframe behavior, which represents date columns as Timestamps
— there may be a small performance tradeoff when this conversion occurs.
When disabled, date columns in a SQL result will be represented as a Timestamp
type, with both a time, and timezone component (midnight and UTC respectively), e.g. 2022-09-12 00:00:00+00:00
. This is in line with the standard Pandas dataframe behavior. Since these columns are timestamps, downstream Pivot and Chart cells will perform timezone conversion on these columns based on the hex_timezone
value. This often results in data appearing as the "wrong" day, since the timezone conversion may result in a timestamp on a different calendar day in the target timezone (for example, midnight in UTC on a Tuesday is still Monday in North and South America).
Disabling this setting can result in unexpected results in downstream Chart and Pivot cells. Only disable this setting when absolutely necessary.
Query metadata
When executing SQL cells, we inject a comment into the queries including metadata on the source of that query. Namely the issuing user's email, a link to the project and cell where the query originated, and the relevant trace ID. This allows database administrators to track down and audit queries originating from Hex. To view query metadata, click on the curly brace { }
icon in the menu at the top of the cell.
Multi-statement queries
You can write multiple statements in a single SQL cell, separating statements with a semicolon (;
).
Note that Athena, Databricks, ClickHouse, & Trino do not support multi-statement queries.
Below is a PostgreSQL example that:
- Drops a table if it already exists
- Creates a new table, defining the column names and types.
- Inserts a row of data into the table
- Returns the contents of the table.
All databases, except for SQL Server, will return the results of the last statement. SQL Server returns the results of the first statement.
Some database have limitations for how statements which write back to the database are supported.
- Athena: See their docs to learn about how to use write back statements.
Killing SQL queries
To kill a query, click on the loading animation to the upper right of the running cell. This will send a request to the database to stop the query.
Clicking Stop/Interrupt from the dropdown of a project's run options will also kill any currently-running queries. If you hover over the load animation of a SQL cell and the tooltip reads "This cell has been queued for execution", this indicates that the cell is still queued to run and the query has not yet been sent to the database; stopping or killing the project's kernel while the cell is still queued will prevent the query from ever being sent to the database.
Requests to kill a SQL query are not always respected by the database at the time of request. Common causes include the query having already completed by the time the request to cancel has reached the database, a warehouse's compute resources being exhausted, or a task scheduler being overloaded.
The only way to ensure that a query initialized from Hex isn't run against your database is to cancel it while still queued, before it has been sent.