Uncool Things in Snowflake

This is the third post in a series of Cool Things to do in Snowflake, but these are things that are really cool when you don’t do them; they’re Uncool Things to Do in Snowflake.

1. Make a date not a date

Snowflake has plenty of out-of-the-box date functions: dayname, datediff, and date_trunc to name just three. You can read more about the functions here. Using a built-in function on a native date object gives you a lot of control, but you give up all this power as soon as you turn a date to not a date. Let’s say you have a perfectly good date, such as 2017-07-31, which is my best estimate of Mac and Cheese’s birthdate.

Let’s say you want year and month; you may be tempted to do:

SELECT 
    LEFT(REPLACE('2017-07-31', '-'), 6) AS birth_year_month
FROM cat_birthdates

And you would get 201707. But now let’s say you want to add 60 days to that.

SELECT 
    LEFT(REPLACE('2017-07-31', '-'), 6) + 60 AS birth_year_month
FROM cat_birthdates

Gives you 201767, which is not a date.

While

SELECT 
    DATEADD('day', 60, LEFT(REPLACE('2017-07-31', '-'), 6)) AS birth_year_month
FROM cat_birthdates

gives you 1970-03-04 08:01:47.000. That is somehow worse.

So how would we handle this? Always keep a date a date

Whenever I have to list a month, I like to use the first of the month:

SELECT 
    DATE_TRUNC('month', '2017-07-31'::Date) AS birth_year_month 
FROM cat_birthdates

which gives:

2017-07-01

You can then add 60 days by doing:

SELECT 
    DATEADD(‘day’, 60, DATE_TRUNC('month', '2017-07-31'::Date)) AS birth_year_month
FROM cat_birthdates

which gives:

2017-08-30

Much more accurate than anything the other queries gave us.

2. Use WHERE NOT IN <subquery> on anything other than a primary key

I am a big fan of WHERE IN <subquery> because it looks so clean:

SELECT animal_name 
FROM animals
WHERE animal_id IN (SELECT animal_id FROM cats);

In this case, you could just do an inner join, but let’s say you didn’t want to do that because you’re doing a DELETE statement to get rid of all non-cats:

DELETE 
FROM animal_name	
WHERE animal_id NOT IN (SELECT animal_id FROM cats);

I think this is readable, but you may get an unexpected error if animal_id has any NULLS (it’s always NULLs). If animal_id is ever NULL, the statement will stop and only delete what records it’s seensince. This is not a bug and it’s not unique to Snowflake; it has to do with how all databases treat NULL. The question and answers here will show you the frustration this can cause. There is also a RDMS agnostic write up here.

My general rules: Use JOINS in a SELECT because you can In DELETEs or UPDATEs, use NOT IN when both sides of the NOT IN cannot be NULL (I only trust when they’re the primary keys because then I know they’ll be not NULL) If you are not sure if the fields can be NULL, use NOT EXISTS since it’s safer

DELETE 
FROM animal_name AS a
WHERE NOT EXISTS
    (SELECT 
        animal_id 
    FROM cats AS c
    WHERE c.animal_id =  a.animal_id)

Read more about Subquery Operators, including ANY and ALL in the Snowflake docs.

3. SELECT * when you don’t mean it

Snowflake is a columnar database; this means you pay (in compute and time) for each column you ask Snowflake to return to you. I wrote about database storage in this post.

I did a SELECT * from a table with 17 columns and a more specific SELECT id from the same table. The bytes scanned went from 142.5KB to 28.0KB, despite that it returned the same numbers of rows.

The bytes scanned increase isn’t linear by column because there is overhead to querying a table, but if you are only planning on using one column, only select that column.

This expense is magnified when we use SELECT * in CTEs or subqueries:

WITH cats_are_cute AS (
    SELECT * 
    FROM animals
    WHERE cat = True
    AND breed != ‘Sphynx’
),
uncute_animals AS (
    SELECT * 
    FROM animals
    WHERE dog = True
    OR bread = ‘Sphynx’
)
SELECT animal_name, ‘meow’ FROM cats_are_cute
UNION ALL
SELECT animal_name, ‘bark’ FROM uncute_animals;

A few things happened here: I selected all columns from animals twice and, excluding my filters, I only used one of them: animal_name. It would be more efficient to write:

WITH cats_are_cute AS (
    SELECT animal_name 
    FROM animals
    WHERE cat = True
    AND breed != ‘Sphynx’
),
uncute_animals AS (
    SELECT animal_name 
    FROM animals
    WHERE dog = True
    OR bread = ‘Sphynx’
)
SELECT animal_name, ‘meow’ FROM cats_are_cute
UNION ALL
SELECT animal_name, ‘bark’ FROM uncute_animals;

Whether or not you see the results of the SELECT *, the database does the work. If you are getting no benefit from it, you should not pay for it.

4. ORDER BY in any statement except for the one returning data

Similar to #3, if you write the following query:

WITH dates AS 
    (SELECT * 
    FROM dates
    ORDER BY dt DESC),
dogs AS 
    (SELECT * 
    FROM adoption_fact)
SELECT * 
FROM dogs
INNER JOIN dates
ON dates.dt = dogs.adopted_date

you will pay for the ORDER BY (you could check the Profile tab to see the SORT)

This is my image

But Snowflake only cares about an ORDER BY in the last SELECT, so this query won’t order your final results. You’ll pay for the time and compute of the ORDER BY, but you will get nothing back.

Note: depending on your query, there may be some ordering that gets to the final SELECT, but this isn’t guaranteed, so it’s useless in most cases.

5. Manually comment out SQL

When using the Snowflake website, you can bulk comment by using COMMAND + /. This then makes it easier to bulk uncomment out. If you manually put in – or //, especially if you mix and match them, you have to then go through and manually uncomment each line. You also prevent someone else from bulk uncommenting if you use –.