Advanced Cool Things in Snowflake

This is part 2 to my blog post Cool Things in Snowflake as requested by one of my many fans. These tips are equally as cool but you may be in situations where they are useful less often.

1. Don't write a CASE WHEN when a Boolean will do

Often I see fields being defined as:

CASE WHEN name IN ('Mac', ‘Cheese') THEN True ELSE False END AS is_mac_and_cheese

This is totally valid, but it can be a bit verbose, especially when there is more than one condition:

CASE WHEN name IN (‘Mac', ‘Cheese') AND type = ‘Cat' THEN True ELSE False END AS is_mac_and_cheese

OR

CASE WHEN (name IN (‘Mac', ‘Cheese') AND type = ‘Cat') OR (is_mac = True or is_cheese = True) THEN True ELSE False END AS is_mac_and_cheese

Clearly, things get out of control quickly.

What these statements have in common is that they are asking Snowflake to evaluate a Boolean and then map the Boolean result to the same Boolean value. Since Snowflake can already evaluate Booleans as fields, we don't need to tell the compiler “THEN True ELSE False”. It would be just as efficient but less verbose to write:

name IN (‘Mac', ‘Cheese') AS is_mac_and_cheese

OR

name IN (‘Mac', ‘Cheese') AND type = ‘Cat' AS is_mac_and_cheese

OR

(name IN (‘Mac', ‘Cheese') AND type = ‘Cat') OR (is_mac = True or is_cheese = True) AS is_mac_and_cheese

Notice that the statements are easier to read. This can be helpful in avoiding nested CASE WHENs statements, which can be hard to manage. Some notes:

2. Hash to see if two tables are exactly equal

Snowflake has two hash functions: HASH and HASH_AGG

These aren't hashes in the cryptographic sense, as the Snowflake reference pages warn, but rather ways to group similar and compare exact values.

I cloned my cats table to create cats2 so for now it's an exact copy. As a result, the following queries give the same result:

SELECT HASH_AGG(*) FROM cats

SELECT HASH_AGG(*) FROM cats2

This is evident when I run:

SELECT HASH_AGG(*) FROM cats
UNION
SELECT HASH_AGG(*) FROM cats2;


Which gives a value of -823639142845967473.

What does this value mean? To me, nothing. The only thing I use it for is to see if cats and cats2 are exactly equal, which I do by running:

SELECT HASH_AGG(*) = (SELECT HASH_AGG(*) FROM cats2)
FROM cats;

which returns True.

In this case, I got the result I wanted, but let's say I insert one additional row to cats. What happens?

SELECT HASH_AGG(*) FROM cats
UNION
SELECT HASH_AGG(*) FROM cats2;

Now returns two rows: -823639142845967473 and 342525326256346.

This tells me one thing: they're not equal. But it doesn't tell me how off they are. If you're comparing two tables (or columns) and you want to know if they're exactly equal, HASH_AGG is a fast way to do it. But if there you want to quantify the degree of difference, you will have to compare them using set functions or JOINs or both.

3. Don't always respect your NULLs

Snowflake reference

Let's say we had this table of cat appointments to the doctor. As we know, cats go to the doctors for annual checkups but also when they eat things that they shouldn't. On annual checks they get weighed, but during an emergency, the vet forgoes that step in order to get to the life saving surgery faster.

cat_name appointment_date weight visit_reason
Mac 2018-06-01 10 lbs Annual
Cheese 2018-06-01 8 lbs Annual
Mac 2019-06-01 11 lbs Annual
Cheese 2019-06-01 9 lbs Annual
Mac 2019-12-01 NULL Emergency string surgery

If we know the weights always existed, we may be able to write something like this:

– get the most recent row for each cat

SELECT cat_name, weight FROM 
(
    SELECT 
        cat_name, 
        weight,
        RANK() OVER (PARTITION BY cat_name ORDER BY appointment_date DESC) AS rank
    FROM cat_appointment_fact
) WHERE rank = 1;

However, this would give you:

cat_name weight
Cheese 9 lbs
Mac NULL

As you can see, Mac's most recent weight value is NULL, which is wrong because he is very 11 pounds of annoying cat.

Instead of getting the most recent record, let's focus on getting the most recent non-NULL value:

SELECT DISTINCT
    cat_name, 
    LAST_VALUE(weight) IGNORE NULLS OVER (PARTITION BY cat_name ORDER BY appointment_date DESC) AS most_recent_weight
FROM cat_appointment_fact;

Notice that we are explicitly finding the most recent weight value, instead of focusing on the most recent row. We are also using the optional “ignore NULLs” clause, as opposed to the default “respect NULLs”. This prevents NULLs from being counted as the most recent value, giving us:

cat_name most_recent_weight
Cheese 9 lbs
Mac 11 lbs

Mac appreciates the accuracy.

4. Carefully balance REPLACE, REGEX and TRY_TO_NUMBER

Let's say you get a number but it's not really a number. This happens often because user entered data can take the form of:

In order to use this as numbers (let's say, perform a mathematical operation on them) we have to clean them up a bit.

My go to is REPLACE. I think it's clear to read and write:

REPLACE(cat_currency, ‘,', ‘')

The problem is that REPLACE does not take multiple characters; in order to clean up all the fields listed, you need to nest the function:

REPLACE(REPLACE(REPLACE(cat_currency, ‘,', ‘'), ‘+', ‘', ‘,')

Parenthesis. Oy. As a general rule, I only use REPLACE if there are one or two characters to clean up. After that, the parentheses become unmanageable and it becomes unreadable.

At this point, especially when your data is user entered and there are no limit to the different non-numeric character you'll see, you may want to consider using REGEX_REPLACE.

This function allows you to use a regex to remove the types of characters you want to remove. In the case of getting numeric results, I suggest ‘[^0-9_.]’ which means: remove everything except characters that are 0-9 or .:

SELECT REGEXP_REPLACE('1,2', '[^0-9_]') from dual; returns 12

SELECT REGEXP_REPLACE('100,200', '[^0-9_]') from dual; returns 100200

SELECT REGEXP_REPLACE('1.2', '[^0-9_.]') from dual; returns 1.2

SELECT REGEXP_REPLACE('cats12', '[^0-9_]') from dual; returns 12

The last function TRY_TO_NUMBER will convert things to number, but will return NULL on failure:

SELECT TRY_TO_NUMBER('1.2'); gives 1 while

SELECT TRY_TO_NUMERIC('1.2', 3,2); gives 1.2

However,

SELECT TRY_TO_NUMBER('cats12') gives NULL

SELECT TRY_TO_NUMBER('100,200'); gives NULL

SELECT TRY_TO_NUMBER('1,2'); gives NULL

Notice that 100,200, a value that a lot of can read as a number, becomes NULL. Since TRY_TO_NUMBER doesn't throw an error, this can lead to a lot of data silently disappearing.

Use REPLACE when you know you're going to want to remove only a few characters which will not grow over time; use REGEX when you will want to handle all characters that can come in the future; and use TRY_TO_NUMBER when you want to discard (turn to NULL) any value that cannot easily be converted to a number.

PS to perform a similar REGEX function for getting non-alpha characters removed, you would do:

SELECT REGEXP_REPLACE('cats12', '[^a-zA-Z]') FROM dual; which returns “cats”

5. Prefer WHERE to HAVING when either will do

Some sources More sources

There is nothing in the Snowflake compiler that will prevent you from doing something like:

SELECT * from cats
HAVING type = ‘Cat'

The compiler may convert this to:

SELECT * from cats
WHERE type = Cat

To be honest the first query isn't going to tip the scales of your account usage, but as you write more and more complicated queries, you want to ensure that you are only using HAVING when you need to. Why? Because HAVING is done after the GROUP BY. It is used to filter aggregated data, so it's more appropriate to do:

SELECT name, count(*) from cats
GROUP BY name
HAVING count(*)>1

Note the aggregate there. A WHERE clause is applied to the rows themselves, so the lowest grain that can be applied. This is done before the grouping. So if you do a WHERE clause before you group, it can lower the number of records you need to group and therefore save compute. If you filter in the HAVING clause, it computes it once the data is aggregated. Since you need the data to be aggregated to find something like count(*), sometimes you have no choice. But if you have a choice, you want to filter before you aggregate, saving you time and money.