CTEs versus Subqueries

I get this question a lot: Is a subquery more performant than a CTE? (I will be testing in Snowflake). It’s a good question. First, let’s go over what each of these looks like when we query them. A Common Table Expression (aka CTE, aka WITH statement) is a temporary data set to be used as part of a query. It only exists during the execution of that query; it cannot be used in other queries even within the same session (from Wikipedia).

A subquery is a nested query; it’s a query within a query (more Wikipedia). This means it gives a result from a query as a temporary data set which can be used again within that query. Does that sound familiar?

The syntax is slightly different. Here is a query to find the latest date a cat was pet using a CTE:

WITH avg_pet_count_over_time AS (
  SELECT
  	cat_id, 
  	MAX(timestamp)::DATE AS max_pet_date,
    MIN(timestamp)::DATE AS min_pet_date 
  FROM cat_pet_fact
  GROUP BY cat_id
)
SELECT 
  cat_name, 
  max_pet_date 
FROM cat_dim
LEFT JOIN avg_pet_count_over_time
ON cat_dim.cat_id = avg_pet_count_over_time.cat_id;

Here is the same query using a subquery:

SELECT 
  cat_name, 
  max_pet_date 
FROM cat_dim
LEFT JOIN  (
  SELECT
  	cat_id, 
  	MAX(timestamp)::DATE AS max_pet_date,
    MIN(timestamp)::DATE AS min_pet_date 
  FROM cat_pet_fact
  GROUP BY cat_id
) AS avg_pet_count_over_time
ON cat_dim.cat_id = avg_pet_count_over_time.cat_id;

Notice how similar these two queries are. The only difference is where the avg_pet_count_over_time is located.

cat_dim is 33 rows, while cat_pet_fact is 1,531,905 rows. I ran both queries in Snowflake (I successfully avoided the cache by adding a current_timestamp to the query). They both scanned 1.3MB.

So there you have it– CTEs and subqueries are the exact same in terms of performance. Since in the CTE the query is on its own and not embedded within another FROM or JOIN statement, it can help logically separate parts of your query.

So it’s all just syntactic sugar? Not quite. What happens if we join to the avg_pet_count_over_time twice. Can the compiler tell that it’s the same data set?

Let’s say we want to find both the minimum and maximum date that a cat was pet (I know I didn’t need to do it like this but creating examples like this is hard). This will look like this:

WITH avg_pet_count_over_time AS 
(
  SELECT 
    cat_id, 
    MAX(timestamp)::DATE AS max_pet_date, 
    MIN(timestamp)::DATE AS min_pet_date 
  FROM cat_pet_fact
  GROUP BY 1
)
SELECT 
  cat_name,
  t1.max_pet_date,
  t2.min_pet_date
FROM cat_dim
LEFT JOIN avg_pet_count_over_time as t1
ON cat_dim.cat_id = t1.cat_id
LEFT JOIN avg_pet_count_over_time as t2
ON cat_dim.cat_id = t2.cat_id;
SELECT 
  cat_name,
  t1.max_pet_date,
  t2.min_pet_date 
FROM cat_dim
LEFT JOIN 
  (SELECT 
    cat_id, 
    MAX(timestamp)::DATE AS max_pet_date,
    MIN(timestamp)::DATE AS min_pet_date
  FROM cat_pet_fact
  GROUP BY 1) AS t1
ON cat_dim.cat_id = t1.cat_id
LEFT JOIN 
  (SELECT 
    cat_id,
    MAX(timestamp)::DATE AS max_pet_date,
    MIN(timestamp)::DATE AS min_pet_date
  FROM cat_pet_fact
  GROUP BY 1) as t2
ON cat_dim.cat_id = t2.cat_id;

Notice right away that the query with the CTE is easier to read and has less redundancy. When I ran this, the CTE statement scanned 1.3MB while the subquery one scanned 2.7MB, doubling the amount of bytes. Why is this? In the CTE query, the compiler knows you’re querying the same data set since it has saved it (albeit temporarily) as avg_pet_count_over_time. In the second query, even though the SQL is the exact same, the compiler does not realize they’re the same query until it runs them. Notice that we have to call the same query by the two distinct aliases: t1 and t2. Not only does this query take more compute and contain redundancy, it also forces us to call the same query two different names. This is misleading; life is much better when both the SQL compiler and your coworkers know when you’re using the same data set rather than creating a new one.

My general advice would be to only use subqueries in adhoc queries when you need results quickly. If the query is going to be read by others, run every day, or reused, try to use a CTE for readability and performance. Performance may not kick in until the CTE is used twice, but if the second JOIN has to be built in, your syntax will allow for that development more easily.

When I talked about this with my team, someone reminded me that I only tested this in Snowflake; other database compilers may handle subqueries differently and that may impact performance. Please keep in mind that I tested this only on Snowflake.