There are two ways to write SQL in Hex: against a data connection or against a dataframe. Together, these two options let you execute some very powerful workflows.
This is the SQL option you are probably most familiar with. Once you have configured a data connection, you can query it though a SQL cell. If your workspace has set up shared data connections, these connections will also be available to choose from.
First, add a new SQL cell and select a data source:
You can then write SQL directly in the cell, using the SQL dialect of the selected connection:
Running an uncached SQL cell will run that query against your database. To avoid accidentally running expensive queries every time you re-run a project, you can cache query results.
If you know you'll be running a lot of queries while developing, it can be smart to add a temporary LIMIT clause to potentially large queries.
SQL cells have an integrated preview which shows up to 10 rows.
Query results are returned as dataframes. The name of the dataframe is easily changed by clicking the grey Returned as dataframe title and editing. The dataframe is automatically re-named and does not require the SQL cell be re-executed to take effect.
We return query results as Pandas DataFrames, which requires that some database-unique column types be transformed into pandas supported datatypes. We try to keep the conversion between database column types to pandas type as consistent as possible between database types.
Generally, the following mappings are applied:
|database column type||pandas column type|
|int (with null values)||float|
Some notable differences between database column types and pandas data types:
inttype column which contains nulls will be forced to a
floattype in the output dataframe. This is due to how pandas handles integer columns.
- Depending on the database, decimal-type columns are returned with varying default number of decimals places (PostgrSQL: 18, BigQuery: 9, Snowflake: 1)
- Datetimes are converted to Pandas integers at nanosecond precision. The number of nanoseconds required to represent far future dates exceeds the maximum allowable number for an
integertype. If we detect such a large datetime, we convert all datetime columns in a dataframe to an
objecttype to avoid causing a conversion error.
When executing SQL cells, we inject a comment into the queries including metadata on the source of that query. Namely the issuing user's email, a link to the project and cell where the query originated, and the relevant trace ID. This allows database administrators to track down and audit queries originating from Hex. To view query metadata, click on the view compiled icon in the top right corner of the cell.
You can write multiple statements in a single SQL cell, separating statements with a semicolon,
Note that Athena, Transform, & Trino do not support multi-statement queries.
Below is a PostgreSQL example that:
- Drops a table if it already exists
- Creates a new table, defining the column names and types.
- Inserts a row of data into the table
- Returns the contents of the table.
All databases, except for SQL Server, will return the results of the last statement. SQL Server returns the results of the first statement.
Some database have limitations for how statements which write back to the database are supported.
- Athena: See their docs to learn about how to use write back statements.
- Transform: Write back statements will return an error, however the statement has actually been executed.
Hex also lets you write SQL against any dataframe in the project, including the results of previous SQL Cells. Dataframe SQL queries execute right in Hex using DuckDB, so they don't incur any usage costs on your data connections. The SQL syntax very similar to PostgreSQL, and the full reference can be found in the DuckDB docs.
Dataframe SQL executes in-memory, which makes it really fast for most queries, but means it can't stack up to a cloud data warehouse for very large queries. If your query processes a lot of data and seems to be running slow, be mindful of whether or not it has to be Dataframe SQL.
To use Dataframe SQL, add a SQL cell and select Dataframes as the data source.
You can then reference any dataframe in the FROM or JOIN clauses, using the dataframe name where you would normally use the SQL table name.
Dataframe indexes are not returned or accessible in a Dataframe SQL query. This also applies to multi-index dataframes, and although queries will not error, you can lose important nested context if you aren't aware of this. If you need to preserve an index, we recommend calling
reset_index() on the dataframe before querying it with SQL.
You can join across dataframes, but you cannot directly join across dataframes and database tables. If you want to join a dataframe and a db table, you'll need to first run a data connection SQL query to get that table as a dataframe in your project. Once everything is a dataframe, you can join the two together.
Caching is not available for Dataframe SQL. This is because Dataframe SQL doesn't hit your database, so caching wouldn't protect you from any accidental spend.
Also, it's because Dataframe SQL runs in-memory and tends to be very quick :)
This decision is often made for you by the location of data. If data is in a database or warehouse, use Data connection SQL. If it's in a dataframe, then use Dataframe SQL.
However, you can build clever and complex things by combining these two in more fluid ways, from "Chained SQL" workflows that help with subqueries to cross-database joins. In general, we hope that it Just Works, but this matrix may be helpful when trying to decide which feature to use:
|Target Use||Connection SQL||Dataframe SQL|
|Can talk to SQL databases||✅||❌|
|Can use database-specific functions||✅||❌|
|Can query Python dataframes and variables||❌||✅|
|Returns results as a dataframe||✅||✅|
|Can join data across sources||❌||✅|
|Works with extremely large datasets*||✅||❌|
|Speed on small — large datasets||🏎️**||🚀|
|Results can be cached||✅||❌|
* To be clear, we're talking of upwards of millions of rows / tens of GB here. ** We don't know how fast your DB is, but we hope it feels like a Formula 1 🙂.
Can't remember available table names, column names, or column data types?
You can browse your database and dataframe schemas using the Schema browser, in the left sidebar with your data connections. Available dataframes can also be inspected in the Schema browser.
From there, you can explore a connection's database schema, tables, and columns by selecting the > button.
You can also click the Browse button on any SQL cell to jump to the current data source in the schema browser.
The initial load time for full schema details can take a few seconds. We cache schema results so that future references to the database details are immediately available.
The contents of the Schema browser are generated upon creation of a data connection, and will be cached until the Schema browser is manually refreshed. To refresh, click the icon next to the data connection name in the Schema browser and select Refresh.