Skip to main content

SQL cells

Query your warehouse, uploaded files, or dataframes directly with SQL.

info

Hex has first-class SQL support via SQL cells, each of which is a fully fledged query IDE complete with autocompletion, caching, a Data browser, and more. Projects can have an unlimited number of SQL cells interleaved with Text, Code, Chart, or Input cells, creating an extremely flexible querying environment.

SQL Cell Principles

Hex's cell-based structure, along with a couple other Hex-specific features, makes for a more powerful but slightly different querying workflow than you may be used to. There are five key principles to understand about writing SQL in Hex:

  1. There are two kinds of SQL queries in Hex: Queries can run against a remote database or warehouse (data connection SQL), or against a local dataframe or .csv (dataframe SQL). Together, these two options let you execute some very powerful workflows.
  2. Data connection SQL queries can return results as dataframes or queries: By default, SQL queries return their entire result set as a dataframe that can be operated on in memory. When working with larger datasets, it's more efficient to put a SQL cell into "query mode", which instead returns a query object— a lightweight "pointer" to the complete result set, with a 100 row preview.
  3. SQL queries can be chained together: Any SQL query can refer to previous SQL queries in a project to form "chains" of queries, much like a complex query might include many CTEs. You can use this chained SQL technique to split logic up atomically and make complex queries more readable.
  4. SQL queries can be parameterized with project logic: Values from other parts of project logic can be inserted into SQL queries using Jinja syntax. This lets you parameterize queries based on user input, the output of an earlier query, or Python code to build complex user-facing data applications powered by SQL.
  5. SQL queries are reactive to updates and user input: Queries will reactively re-run whenever their inputs or dependencies change, making for fast iteration and immediate feedback loops. Project state will always be up to date, even across many SQL, Python, or other cells.

When should I use data connection SQL vs dataframe SQL?

This decision is often made for you by the location of data. If data is in a database or warehouse, you must use data connection SQL. If it's in a dataframe or a .csv, you must use dataframe SQL.

However, you can build clever and complex things by combining these two in more fluid ways, from "Chained SQL" workflows that replace subqueries to powerful cross-database joins. Some examples of when you might reach for dataframe SQL over data connection SQL:

  • If you need to run an intermediary operation in Python. Say you run a Snowflake SQL query, and then need to geocode a column using Python. Dataframe SQL will let you write SQL against the output of that intermediary geocoding operation, even though that data is no longer in Snowflake.
  • If you need to query data from a .csv file. Dataframe SQL can directly run queries against uploaded files, with just a SELECT * FROM "file_name.csv".
  • If you need to join data across different data sources. You can run two data connection SQL queries, one against a Snowflake connection and one against a Redshift DB, and then join the two dataframe outputs together using a Dataframe SQL query.

The biggest reason not to use dataframe SQL, if you're in a situation where you can write data connection SQL instead? Dataframe SQL queries must be written in DuckDB's PostgreSQL-like flavor of SQL, which doesn't support some database-specific functions— or just might feel unfamiliar to you. Also, dataframe SQL requires the dataset be loaded in-memory, which precludes extremely large datasets from being queried this way.

In general, we hope that it Just Works, but this matrix may be helpful when trying to decide which mode to use:

Target UseConnection SQLDataframe SQL
Can talk to SQL databases
Can use database-specific functions
Can query Python dataframes and variables
Returns results as a dataframe
Can join data across sources
Works with extremely large datasets*
Speed on small — large datasets🏎️**🚀
Results can be cached

* To be clear, we're talking of upwards of tens of millions of rows / tens of GB here.

** We don't know how fast your DB is, but we hope it feels like a Formula 1 🙂.

When should I use query mode vs. dataframe mode?

Dataframe mode (the default) should generally work great for most queries on small-medium size datasets. Query mode should generally only be used when working with very large datasets.

Query mode queries are generally used as intermediary steps in an aggregation pipeline. You can think of this workflow like a funnel: you are starting with an extremely large dataset, and progressively making it smaller and smaller as you understand more about it (via subsequent chained SQL queries), until it's small enough to be returned in-memory with dataframe mode and visualized.

Dataframe SQL queries always return dataframes.

When should I chain SQL?

Hex supports a workflow called chained SQL, where a SQL query references another SQL query from the project in the FROM clause. This functionality is slightly different across data connection and dataframe SQL, but the end result is the same: you can break large SQL queries up into individual cells that are clean and easy to debug. There are many examples and details in the reference section, but the two main use cases of chained SQL are easy to understand: simplifying complex queries, and letting SQL users 'drill down' into data.

Chained SQL can be used to break complex queries into atomic units that are easy to read, understand, and debug. A query with three CTEs can become four chained SQL cells, each able to be run, edited, cached, and debugged separately.

Chained SQL also lets users explore and iterate without rewriting code. Run a query with an interesting output that you'd like to drill deeper into? Traditionally, you'd copy that entire query and change it to fit your new format. If you had to aggregate at different levels, you'd begin to write a CTE (see point #1). With Chained SQL, you simply add a new SQL cell and write a fresh query against the interesting output.

This is available for dataframe SQL and data connection SQL, but is evaluated somewhat differently— dataframe SQL is actually running against the materialized output of previous queries, while chained data connection queries dynamically compile to CTEs in a final query executed on the remote connection. In practice, this doesn't have too much of an impact on user experience outside of the fact that staying in data connection SQL means you can keep using warehouse-specific SQL syntax features. Read more in the reference section of this doc.

Query reactivity and caching

Hex continuously and reactively re-executes cells when their dependencies and inputs change. This makes intuitive sense for Python code in the context of a notebook, but can be an adjustment for SQL users that are used to having to click "Run". Here's what you really need to know:

  • This keeps the state of the notebook up-to-date. You get fast feedback and cells are never out of sync with each other.
  • Dataframe SQL queries are "free" and fast. These queries run in-memory, in Hex, and never hit your warehouse. You can run a multi-million row dataframe SQL query ten billion trillion times and it won't cost you a cent (it will, however, bankrupt Hex). They are extremely fast, generally on the order of hundreds of milliseconds, so there's no need to worry about them reactively executing.
  • Data connection SQL queries can be cached. You can cache a query either "on initial app load", so it will execute against the connection once in each session, or "on scheduled run", so that it will only execute against the connection twice a day, or once a week. This can save both money and time when working with large SQL queries. More on caching here.
  • Only SQL queries that have had a dependency update will re-execute. If you run a massive SELECT * in the very first cell of your project, that query won't re-execute every time you change the input parameter halfway through. It's still a good idea to cache, but part of the magic is that Hex only ever runs necessary cells.
  • If you are working with huge, scary data, you can enable "Cell Only" run mode to prevent reactive re-execution of cells, or enable query mode for safety.

SQL Cell Reference

Data connection SQL

This is the SQL option you are probably most familiar with. Once you have configured a data connection to your warehouse or database, you can query it though a SQL cell. If your workspace has set up shared data connections, these connections will also be available to choose from.

First, add a new SQL cell and select a data source:

You can then write SQL directly in the cell, using the SQL dialect of the selected connection:

SQL cells within a project share the same database session, allowing for multi-statement queries and alter/update workflows.

caution

Running an uncached SQL cell will run that query against your database. To avoid accidentally running expensive queries every time you re-run a project, you can cache query results.

If you know you'll be running a lot of queries while developing, it can be smart to add a temporary LIMIT clause to potentially large queries.

SQL cells have an integrated preview which shows up to 10 rows.

Query mode

Data connection SQL cells can return results in two different ways: dataframe mode or query mode. Dataframe mode, the default, is great for most use cases. Query mode should generally be enabled only when working with extremely large result sets.

In dataframe mode (the default), SQL cells return their entire result set in-memory as a pandas dataframe. If you run "SELECT names FROM fruit_table" in dataframe mode, there will now be an object taking up space in the project's memory that contains a few hundred fruits banana,apple,pear,etc. You can operate on that dataframe anywhere in Hex— reshape it in a Python cell, query it using dataframe SQL, visualize it in a Chart cell, etc.

In query mode, SQL cells return only a 100 row preview of the result set alongside a "query" object that serves as a pointer to the complete result set. The complete result set does not leave the warehouse or database, which makes this mode far superior to dataframe mode when working with large datasets. Waiting for a 100 million row query result to stream into Hex in dataframe mode could take minutes, or crash the notebook; If you're going to continue operating on and aggregating the data, you may as well leave it in the warehouse and just return a query that you can reference in subsequent chained SQL queries, pivot cells and filter cells.

Dataframe SQL

Hex also lets you write dataframe SQL against any dataframe in the project, including the results of previous SQL Cells.

Dataframe SQL queries execute right in Hex using DuckDB, so they don't incur any usage costs on your data connections. The SQL syntax is very similar to PostgreSQL, and the full reference can be found in the DuckDB docs.

tip

Wondering whether to use dataframe SQL or not? Check out the SQL Principles section of this doc.

To use dataframe SQL, add a SQL cell and select Dataframes as the data source.

You can then reference any dataframe in the FROM or JOIN clauses, using the dataframe name where you would normally use the SQL table name.

Dataframe indexes are not returned or accessible in a dataframe SQL query. This also applies to multi-index dataframes, and although queries will not error, you can lose important nested context if you aren't aware of this. If you need to preserve an index, we recommend calling reset_index() on the dataframe before querying it with SQL.

You can join across dataframes, but you cannot directly join across dataframes and database tables. If you want to join a dataframe and a db table, you'll need to first run a data connection SQL query to get that table as a dataframe in your project. Once everything is a dataframe, you can join the two together.

caution

Caching is not available for dataframe SQL. This is because dataframe SQL doesn't hit your database, so caching wouldn't protect you from any accidental spend.

Also, it's because dataframe SQL runs in-memory and tends to be very quick :)

Chained SQL

Chained SQL is a common workflow in Hex that allows a SQL query to reference other SQL queries from the project in its FROM clause. This functionality is slightly different across data connection and dataframe SQL, but the end result is the same: you can break large SQL queries up into individual cells that are clean, efficient, and easy to debug.

tip

For advice on when to use chained SQL, see the SQL Principles section of this doc.

Chaining dataframe SQL

Dataframe SQL queries are often chained, and it's seamless to do so. Every SQL query can return a dataframe as its output, and because dataframe SQL (duh) can query dataframes, you can query the results of any previous query. Say you have a .csv file, "users.csv". You write a dataframe SQL query against it to check it out, and name this query users

SELECT user_id,created_at,last_seen_at, monthly_payment from "users.csv";
WHERE is_customer is TRUE

When you go to write another, more in depth query, you don't have to write it against "users.csv" anymore. You can write a chained dataframe SQL query directly against the users dataframe:

SELECT DATE_TRUNC(month,'last_seen_at') AS month, COUNT(*)
FROM users
GROUP BY 1

This is a simple example, but you can see how, in a complex project, this could get useful. Especially if you dabble into Python; imagine you have Python code that generates cohorts from that users query:

cohorted_users = generate_cohorts(users)

Rather than be stuck in Python, you can jump right back into SQL using a dataframe SQL cell:

SELECT cohort_id, SUM(monthly_payment) AS cohort_value
FROM cohorted_users
GROUP BY 1

Chaining data connection SQL

tip

Chained SQL works great with Query mode, but SQL cells do not need to be in query mode to be chained using data connection SQL. Even if a SQL cell is returning results as a dataframe, it can be chained in a data connection SQL query.

In practice, chained SQL queries work almost exactly the same for data connection SQL and dataframe SQL. There are just three important differences.

  1. When chaining data connection SQL queries, all queries must be running against the same data connection. You can chain three queries against a Snowflake warehouse. You cannot chain two Snowflake queries with one Redshift query (to do that, you'd use dataframe SQL).

  2. Chained SQL for data connections uses the SQL syntax of the connection. For many use cases, this is the primary differentiator from dataframe SQL queries, which have to be written in DuckDB's PostgreSQL-like flavor of SQL.

  3. When chained, data connection SQL writes a query with CTEs. Imagine a Hex project with two SQL cells, both querying a Snowflake warehouse. The first SQL cell, called customer_order_facts, is defined as

SELECT customer_id, MAX(order_total) AS most_expensive_order, SUM(order_total) AS lifetime_order_total, COUNT(*) AS count_orders
FROM orders

Later in the project, another SQL cell called order_total_distribution looks like this:

SELECT lifetime_order_total, COUNT(*)
FROM customer_order_facts

order_total_distribution is now a Chained SQL query. It is written against a Snowflake data connection, but it is not referencing a table in the Snowflake warehouse; It's referencing another query from the same Hex project. When run, Hex will compile this query into:

WITH customer_order_facts AS (
SELECT customer_id, MAX(order_total) AS most_expensive_order, SUM(order_total) AS lifetime_order_total, COUNT(*) AS count_orders
FROM orders
)
SELECT lifetime_order_total, COUNT(*)
FROM customer_order_facts

You can see these compiled queries by clicking "View compiled" in a SQL cell.

Troubleshooting Chained SQL

At times, Hex may not be able to compile CTE references correctly. To check if this is happening, view the compiled SQL via the view compiled icon.

Often, this is the result of a CTE sharing a name with a reserved SQL keyword, which results in the SQL parser being unable to swap the CTE name with a query.

To workaround this, replace any CTE names that use reserved keywords.

Query datatypes

When query results are returned in-memory, they are stored as Pandas DataFrames, which requires that some database-unique column types be transformed into pandas supported datatypes. We try to keep the conversion between database column types to pandas type as consistent as possible between database types.

Generally, the following mappings are applied:

database column typepandas column type
intint
int (with null values)float
floatfloat
decimalfloat¹
stringobject
date²date
timestamptimestamp

¹By default, decimals will be cast to floats. This behavior can be turned off — see the below section for more information.

Some notable differences between database column types and pandas data types:

  • Any int type column which contains nulls will be forced to a float type in the output dataframe. This is due to how pandas handles integer columns.
  • Depending on the database, decimal-type columns are returned with varying default number of decimals places (PostgreSQL: 18, BigQuery: 9, Snowflake: 1)
  • Datetimes are converted to Pandas integers at nanosecond precision. The number of nanoseconds required to represent far future dates exceeds the maximum allowable number for an integer type. If we detect such a large datetime, we convert all datetime columns in a dataframe to an object type to avoid causing a conversion error.

Infer numeric datatypes

Default: Enabled

When enabled, any column with a decimal type will be cast to a float. This is because Pandas does not support decimal datatypes. In general, casting a decimal to a float makes the number easier to work with downstream, and easier to format in our Table display cells.

However, there are times when casting a high-precision decimal to a float will result in loss of precision. In this case, you can disable the Infer numeric datatypes option in the SQL cell configuration options. When this setting is turned off, decimals will instead be represented as an object type.

Use native Python dates

Default: Enabled

When enabled, date columns in a SQL result will be represented as a Date type, with no time, or timezone component, e.g. 2022-09-12. This allows downstream cells, such as Pivot and Chart cells, to handle these values in the expected way. This differs from the standard Pandas dataframe behavior, which represents date columns as Timestamps — there may be a small performance tradeoff when this conversion occurs.

When disabled, date columns in a SQL result will be represented as a Timestamp type, with both a time, and timezone component (midnight and UTC respectively), e.g. 2022-09-12 00:00:00+00:00. This is in line with the standard Pandas dataframe behavior. Since these columns are timestamps, downstream Pivot and Chart cells will perform timezone conversion on these columns based on the hex_timezone value. This often results in data appearing as the "wrong" day, since the timezone conversion may result in a timestamp on a different calendar day in the target timezone (for example, midnight in UTC on a Tuesday is still Monday in North and South America).

caution

Disabling this setting can result in unexpected results in downstream Chart and Pivot cells. Only disable this setting when absolutely necessary.

Query metadata

When executing SQL cells, we inject a comment into the queries including metadata on the source of that query. Namely the issuing user's email, a link to the project and cell where the query originated, and the relevant trace ID. This allows database administrators to track down and audit queries originating from Hex. To view query metadata, click on the view compiled icon in the top right corner of the cell.

Multi-statement queries

You can write multiple statements in a single SQL cell, separating statements with a semicolon,;.

warning

Note that Athena, Databricks, ClickHouse, & Trino do not support multi-statement queries.

Below is a PostgreSQL example that:

  1. Drops a table if it already exists
  2. Creates a new table, defining the column names and types.
  3. Inserts a row of data into the table
  4. Returns the contents of the table.

All databases, except for SQL Server, will return the results of the last statement. SQL Server returns the results of the first statement.

Some database have limitations for how statements which write back to the database are supported.

  • Athena: See their docs to learn about how to use write back statements.

Killing SQL queries

To kill a query, click on the loading animation to the upper right of the running cell. This will send a request to the database to stop the query.

Clicking Stop/Interrupt from the dropdown of a project's run options will also kill any currently-running queries. If you hover over the load animation of a SQL cell and the tooltip reads "This cell has been queued for execution", this indicates that the cell is still queued to run and the query has not yet been sent to the database; stopping or killing the project's kernel while the cell is still queued will prevent the query from ever being sent to the database.