Hex has first-class SQL support via SQL cells, each of which is a fully fledged query IDE complete with autocompletion, caching, a Data browser, and more. Projects can have an unlimited number of SQL cells interleaved with Text, Code, Chart, or Input cells, creating an extremely flexible querying environment.
SQL Cell Principles
Hex's cell-based structure, along with a couple other Hex-specific features, makes for a more powerful but slightly different querying workflow than you may be used to. There are five key principles to understand about writing SQL in Hex:
- There are two kinds of SQL queries in Hex: Queries can run against a remote database or warehouse (data connection SQL), or against a local dataframe or .csv (dataframe SQL). Together, these two options let you execute some very powerful workflows.
- Data connection SQL queries can return results as dataframes or queries: By default, SQL queries return their entire result set as a dataframe that can be operated on in memory. When working with larger datasets, it's more efficient to put a SQL cell into "query mode", which instead returns a query object— a lightweight "pointer" to the complete result set, with a 100 row preview.
- SQL queries can be chained together: Any SQL query can refer to previous SQL queries in a project to form "chains" of queries, much like a complex query might include many CTEs. You can use this chained SQL technique to split logic up atomically and make complex queries more readable.
- SQL queries can be parameterized with project logic: Values from other parts of project logic can be inserted into SQL queries using Jinja syntax. This lets you parameterize queries based on user input, the output of an earlier query, or Python code to build complex user-facing data applications powered by SQL.
- SQL queries are reactive to updates and user input: Queries will reactively re-run whenever their inputs or dependencies change, making for fast iteration and immediate feedback loops. Project state will always be up to date, even across many SQL, Python, or other cells.
When should I use data connection SQL vs dataframe SQL?
This decision is often made for you by the location of data. If data is in a database or warehouse, you must use data connection SQL. If it's in a dataframe or a .csv, you must use dataframe SQL.
However, you can build clever and complex things by combining these two in more fluid ways, from "Chained SQL" workflows that replace subqueries to powerful cross-database joins. Some examples of when you might reach for dataframe SQL over data connection SQL:
- If you need to run an intermediary operation in Python. Say you run a Snowflake SQL query, and then need to geocode a column using Python. Dataframe SQL will let you write SQL against the output of that intermediary geocoding operation, even though that data is no longer in Snowflake.
- If you need to query data from a .csv file. Dataframe SQL can directly run queries against uploaded files, with just a
SELECT * FROM "file_name.csv".
- If you need to join data across different data sources. You can run two data connection SQL queries, one against a Snowflake connection and one against a Redshift DB, and then join the two dataframe outputs together using a Dataframe SQL query.
The biggest reason not to use dataframe SQL, if you're in a situation where you can write data connection SQL instead? Dataframe SQL queries must be written in DuckDB's PostgreSQL-like flavor of SQL, which doesn't support some database-specific functions— or just might feel unfamiliar to you. Also, dataframe SQL requires the dataset be loaded in-memory, which precludes extremely large datasets from being queried this way.
In general, we hope that it Just Works, but this matrix may be helpful when trying to decide which mode to use:
|Target Use||Connection SQL||Dataframe SQL|
|Can talk to SQL databases||✅||❌|
|Can use database-specific functions||✅||❌|
|Can query Python dataframes and variables||❌||✅|
|Returns results as a dataframe||✅||✅|
|Can join data across sources||❌||✅|
|Works with extremely large datasets*||✅||❌|
|Speed on small — large datasets||🏎️**||🚀|
|Results can be cached||✅||❌|
* To be clear, we're talking of upwards of tens of millions of rows / tens of GB here.
** We don't know how fast your DB is, but we hope it feels like a Formula 1 🙂.
When should I use query mode vs. dataframe mode?
Dataframe mode (the default) should generally work great for most queries on small-medium size datasets. Query mode should generally only be used when working with very large datasets.
Query mode queries are generally used as intermediary steps in an aggregation pipeline. You can think of this workflow like a funnel: you are starting with an extremely large dataset, and progressively making it smaller and smaller as you understand more about it (via subsequent chained SQL queries), until it's small enough to be returned in-memory with dataframe mode and visualized.
Dataframe SQL queries always return dataframes.
When should I chain SQL?
Hex supports a workflow called chained SQL, where a SQL query references another SQL query from the project in the
FROM clause. This functionality is slightly different across data connection and dataframe SQL, but the end result is the same: you can break large SQL queries up into individual cells that are clean and easy to debug. There are many examples and details in the reference section, but the two main use cases of chained SQL are easy to understand: simplifying complex queries, and letting SQL users 'drill down' into data.
Chained SQL can be used to break complex queries into atomic units that are easy to read, understand, and debug. A query with three CTEs can become four chained SQL cells, each able to be run, edited, cached, and debugged separately.
Chained SQL also lets users explore and iterate without rewriting code. Run a query with an interesting output that you'd like to drill deeper into? Traditionally, you'd copy that entire query and change it to fit your new format. If you had to aggregate at different levels, you'd begin to write a CTE (see point #1). With Chained SQL, you simply add a new SQL cell and write a fresh query against the interesting output.
This is available for dataframe SQL and data connection SQL, but is evaluated somewhat differently— dataframe SQL is actually running against the materialized output of previous queries, while chained data connection queries dynamically compile to CTEs in a final query executed on the remote connection. In practice, this doesn't have too much of an impact on user experience outside of the fact that staying in data connection SQL means you can keep using warehouse-specific SQL syntax features. Read more in the reference section of this doc.
Query reactivity and caching
Hex continuously and reactively re-executes cells when their dependencies and inputs change. This makes intuitive sense for Python code in the context of a notebook, but can be an adjustment for SQL users that are used to having to click "Run". Here's what you really need to know:
- This keeps the state of the notebook up-to-date. You get fast feedback and cells are never out of sync with each other.
- Dataframe SQL queries are "free" and fast. These queries run in-memory, in Hex, and never hit your warehouse. You can run a multi-million row dataframe SQL query ten billion trillion times and it won't cost you a cent (it will, however, bankrupt Hex). They are extremely fast, generally on the order of hundreds of milliseconds, so there's no need to worry about them reactively executing.
- Data connection SQL queries can be cached. You can cache a query either "on initial app load", so it will execute against the connection once in each session, or "on scheduled run", so that it will only execute against the connection twice a day, or once a week. This can save both money and time when working with large SQL queries. More on caching here.
- Only SQL queries that have had a dependency update will re-execute. If you run a massive
SELECT *in the very first cell of your project, that query won't re-execute every time you change the input parameter halfway through. It's still a good idea to cache, but part of the magic is that Hex only ever runs necessary cells.
- If you are working with huge, scary data, you can enable "Cell Only" run mode to prevent reactive re-execution of cells, or enable query mode for safety.
SQL Cell Reference
Data connection SQL
This is the SQL option you are probably most familiar with. Once you have configured a data connection to your warehouse or database, you can query it though a SQL cell. If your workspace has set up shared data connections, these connections will also be available to choose from.
First, add a new SQL cell and select a data source:
You can then write SQL directly in the cell, using the SQL dialect of the selected connection:
SQL cells within a project share the same database session, allowing for multi-statement queries and alter/update workflows.
Running an uncached SQL cell will run that query against your database. To avoid accidentally running expensive queries every time you re-run a project, you can cache query results.
If you know you'll be running a lot of queries while developing, it can be smart to add a temporary LIMIT clause to potentially large queries.
SQL cells have an integrated preview which shows up to 10 rows.
Data connection SQL cells can return results in two different ways: dataframe mode or query mode. Dataframe mode, the default, is great for most use cases. Query mode should generally be enabled only when working with extremely large result sets.
In dataframe mode (the default), SQL cells return their entire result set in-memory as a pandas dataframe. If you run "SELECT names FROM fruit_table" in dataframe mode, there will now be an object taking up space in the project's memory that contains a few hundred fruits
banana,apple,pear,etc. You can operate on that dataframe anywhere in Hex— reshape it in a Python cell, query it using dataframe SQL, visualize it in a Chart cell, etc.
In query mode, SQL cells return only a 100 row preview of the result set alongside a "query" object that serves as a pointer to the complete result set. The complete result set does not leave the warehouse or database, which makes this mode far superior to dataframe mode when working with large datasets. Waiting for a 100 million row query result to stream into Hex in dataframe mode could take minutes, or crash the notebook; If you're going to continue operating on and aggregating the data, you may as well leave it in the warehouse and just return a query that you can reference in subsequent chained SQL queries.
Hex also lets you write dataframe SQL against any dataframe in the project, including the results of previous SQL Cells.
Dataframe SQL queries execute right in Hex using DuckDB, so they don't incur any usage costs on your data connections. The SQL syntax is very similar to PostgreSQL, and the full reference can be found in the DuckDB docs.
Wondering whether to use dataframe SQL or not? Check out the SQL Principles section of this doc.
To use dataframe SQL, add a SQL cell and select Dataframes as the data source.
You can then reference any dataframe in the FROM or JOIN clauses, using the dataframe name where you would normally use the SQL table name.
Dataframe indexes are not returned or accessible in a dataframe SQL query. This also applies to multi-index dataframes, and although queries will not error, you can lose important nested context if you aren't aware of this. If you need to preserve an index, we recommend calling
reset_index() on the dataframe before querying it with SQL.
You can join across dataframes, but you cannot directly join across dataframes and database tables. If you want to join a dataframe and a db table, you'll need to first run a data connection SQL query to get that table as a dataframe in your project. Once everything is a dataframe, you can join the two together.
Caching is not available for dataframe SQL. This is because dataframe SQL doesn't hit your database, so caching wouldn't protect you from any accidental spend.
Also, it's because dataframe SQL runs in-memory and tends to be very quick :)
Chained SQL is a common workflow in Hex that allows a SQL query to reference other SQL queries from the project in its FROM clause. This functionality is slightly different across data connection and dataframe SQL, but the end result is the same: you can break large SQL queries up into individual cells that are clean, efficient, and easy to debug.
For advice on when to use chained SQL, see the SQL Principles section of this doc.
Chaining dataframe SQL
Dataframe SQL queries are often chained, and it's seamless to do so. Every SQL query can return a dataframe as its output, and because dataframe SQL (duh) can query dataframes, you can query the results of any previous query. Say you have a .csv file, "users.csv". You write a dataframe SQL query against it to check it out, and name this query
SELECT user_id,created_at,last_seen_at, monthly_payment from "users.csv";
WHERE is_customer is TRUE
When you go to write another, more in depth query, you don't have to write it against "users.csv" anymore. You can write a chained dataframe SQL query directly against the
SELECT DATE_TRUNC(month,'last_seen_at') AS month, COUNT(*)
GROUP BY 1
This is a simple example, but you can see how, in a complex project, this could get useful. Especially if you dabble into Python; imagine you have Python code that generates cohorts from that users query:
cohorted_users = generate_cohorts(users)
Rather than be stuck in Python, you can jump right back into SQL using a dataframe SQL cell:
SELECT cohort_id, SUM(monthly_payment) AS cohort_value
GROUP BY 1
Chaining data connection SQL
Chained SQL works great with Query mode, but SQL cells do not need to be in query mode to be chained using data connection SQL. Even if a SQL cell is returning results as a dataframe, it can be chained in a data connection SQL query.
In practice, chained SQL queries work almost exactly the same for data connection SQL and dataframe SQL. There are just three important differences.
When chaining data connection SQL queries, all queries must be running against the same data connection. You can chain three queries against a Snowflake warehouse. You cannot chain two Snowflake queries with one Redshift query (to do that, you'd use dataframe SQL).
Chained SQL for data connections uses the SQL syntax of the connection. For many use cases, this is the primary differentiator from dataframe SQL queries, which have to be written in DuckDB's PostgreSQL-like flavor of SQL.
When chained, data connection SQL writes a query with CTEs. Imagine a Hex project with two SQL cells, both querying a Snowflake warehouse. The first SQL cell, called
customer_order_facts, is defined as
SELECT customer_id, MAX(order_total) AS most_expensive_order, SUM(order_total) AS lifetime_order_total, COUNT(*) AS count_orders
Later in the project, another SQL cell called
order_total_distribution looks like this:
SELECT lifetime_order_total, COUNT(*)
order_total_distribution is now a Chained SQL query. It is written against a Snowflake data connection, but it is not referencing a table in the Snowflake warehouse; It's referencing another query from the same Hex project. When run, Hex will compile this query into:
WITH customer_order_facts AS (
SELECT customer_id, MAX(order_total) AS most_expensive_order, SUM(order_total) AS lifetime_order_total, COUNT(*) AS count_orders
SELECT lifetime_order_total, COUNT(*)
You can see these compiled queries by clicking "View compiled" in a SQL cell
When query results are returned in-memory, they are stored as Pandas DataFrames, which requires that some database-unique column types be transformed into pandas supported datatypes. We try to keep the conversion between database column types to pandas type as consistent as possible between database types.
Generally, the following mappings are applied:
|database column type||pandas column type|
|int (with null values)||float|
¹By default, decimals will be cast to floats. This behavior can be turned off — see the below section for more information.
Some notable differences between database column types and pandas data types:
inttype column which contains nulls will be forced to a
floattype in the output dataframe. This is due to how pandas handles integer columns.
- Depending on the database, decimal-type columns are returned with varying default number of decimals places (PostgreSQL: 18, BigQuery: 9, Snowflake: 1)
- Datetimes are converted to Pandas integers at nanosecond precision. The number of nanoseconds required to represent far future dates exceeds the maximum allowable number for an
integertype. If we detect such a large datetime, we convert all datetime columns in a dataframe to an
objecttype to avoid causing a conversion error.
Infer numeric datatypes
When enabled, any column with a decimal type will be cast to a float. This is because Pandas does not support decimal datatypes. In general, casting a decimal to a float makes the number easier to work with downstream, and easier to format in our Table display cells.
However, there are times when casting a high-precision decimal to a float will result in loss of precision. In this case, you can disable the Infer numeric datatypes option in the SQL cell configuration options. When this setting is turned off, decimals will instead be represented as an object type.
Use native Python dates
When enabled, date columns in a SQL result will be represented as a
Date type, with no time, or timezone component, e.g.
2022-09-12. This allows downstream cells, such as Pivot and Chart cells, to handle these values in the expected way. This differs from the standard Pandas dataframe behavior, which represents date columns as
Timestamps — there may be a small performance tradeoff when this conversion occurs.
When disabled, date columns in a SQL result will be represented as a
Timestamp type, with both a time, and timezone component (midnight and UTC respectively), e.g.
2022-09-12 00:00:00+00:00. This is in line with the standard Pandas dataframe behavior. Since these columns are timestamps, downstream Pivot and Chart cells will perform timezone conversion on these columns based on the
hex_timezone value. This often results in data appearing as the "wrong" day, since the timezone conversion may result in a timestamp on a different calendar day in the target timezone (for example, midnight in UTC on a Tuesday is still Monday in North and South America).
Disabling this setting can result in unexpected results in downstream Chart and Pivot cells. Only disable this setting when absolutely necessary.
When executing SQL cells, we inject a comment into the queries including metadata on the source of that query. Namely the issuing user's email, a link to the project and cell where the query originated, and the relevant trace ID. This allows database administrators to track down and audit queries originating from Hex. To view query metadata, click on the view compiled icon in the top right corner of the cell.
You can write multiple statements in a single SQL cell, separating statements with a semicolon,
Note that Athena, Databricks, & Trino do not support multi-statement queries.
Below is a PostgreSQL example that:
- Drops a table if it already exists
- Creates a new table, defining the column names and types.
- Inserts a row of data into the table
- Returns the contents of the table.
All databases, except for SQL Server, will return the results of the last statement. SQL Server returns the results of the first statement.
Some database have limitations for how statements which write back to the database are supported.
- Athena: See their docs to learn about how to use write back statements.
Can't remember available table names, column names, or column data types?
You can browse your database and dataframe schemas using the Data browser, formally known as the schema browser, in the left sidebar with your data connections. Available dataframes can also be inspected in the Data browser.
From there, you can explore a connection's databases, schemas, tables, and columns by selecting the > button and expanding the contents of your database. Click on a table name to see the table's column names and types in the pop up at the bottom of the sidebar.
Metadata from dbt projects will also be displayed if you have a dbt integration configured. If the selected table is part of a model or defined as a source in dbt, you can see the model's most recent job execution date, source freshness, and the status of any tests configured on the model.
You can also click the Browse button on any SQL cell to jump to the current data connection in the Data browser.
The initial load time for full schema details can take a few seconds. We cache schema results so that future references to the database details are immediately available.
Data browser search
Search the contents of your data connection in the search bar at the top of the Data browser. View databases, schemas, and tables that match your search in the Objects tab, and columns that match your search in the Columns tab. You can filter your search results to one type of database object by preceding your search with
Hover over a table in a SQL cell query and select Search for table in data browser to launch a search.
Pin databases and schemas
Filter which databases and schemas show up in the Data browser using the pinning feature. To pin a database or schema, hover over the database object in the Data browser and select the pin icon to the right. You can also right click on the object and select Pin this database or Pin this schema. To display only pinned objects in your search result, select the Pinned button below the search bar at the top of the Data browser.
Pinned objects are preserved across projects for each user and data connection. If you pin a few tables and a schema in one project, these objects will be pinned when using the same data connection in a different project, but only for your Hex user.
Data browser shortcuts
Click the three-dot menu next to your selected table in order to see the available shortcuts.
Query table: This opens a new SQL cell with a query selecting all columns from the table, limited to 100 rows.
Copy qualified table name: This copies the fully qualified table name to your clipboard.
Copy all columns: This copies a SQL query that selects all columns from the table to your clipboard. This shortcut can be useful if you need to select all but a few columns.
Search within table: This launches a Data browser search filtered to the table.
Refreshing the cache
The contents of the Data browser are generated upon creation of a data connection, and will be cached until the Data browser is manually refreshed. To refresh, click the refresh icon next to the data connection name at the top of the Data browser.