Much Ado About NULL Things

We have to talk about NULLs.

This post won’t include anything revolutionary, but rather just a summary of what NULLs are, what they aren't, and why they are important in data warehousing. It’s based on my own experience, not anything academic.

What is NULL?

NULL means nothing, but NULL is not “Nothing.” NULL means that “a data value does not exist in the database” (Source). The value simply isn't there; the value is not 0, N/A, or False. N/A, for all its ambiguity, is a value. Resisting the temptation to create a default value for a field may be counter-intuitive, but using NULL correctly can benefit both the end user and you.

Why does using NULL help the end users?

Correct Aggregated Results

Let’s say you have a table that includes both dates and names, like the following:

Cat_id Cat_name Adopted_Date Cat_color_id
1 Mac 2017-10-01 1
2 Cheese 2017-10-01 1

If we want to add a future cat but we don’t use NULL, we may put in an entry such as the following to communicate that I don’t yet have a cat:

Cat_id Cat_name Adopted_Date Cat_color_id
1 Mac 2017-10-01 1
2 Cheese 2017-10-01 1
3 No Name 2099-12-31 3

If you use intuition, the results of a select * will make it clear that I adopted two cats. However, this is a small table; in bigger tables we are unable to view all records at once. As a result, users often use aggregates such as:

SELECT COUNT(*) FROM cat_adoption_fact;

or

SELECT 
	MONTH(adopted_date) AS adopted_month, 
	COUNT(*)  AS cat_count
FROM cat_adoption_fact
GROUP BY adopted_month

The former query gives 3 results, which may indicate to the stakeholders that I have three cats when, unfortunately, I do not. The latter query is more of an issue though. The results are:

adopted_month cat_count
10 2
12 1

This table clearly tells the end user that I adopted two cats in October and one in December. I wish.

Easier Querying

Since a default is automatically different depending on the data type, using NULL is the only way to gain consistency across fields and data sets.

For instance, from the above cat_adoption_fact table, in order to correctly get the results expected from the above query, a user would have to write:

SELECT 
	MONTH(adopted_date) AS adopted_month, 
	SUM(CASE WHEN cat_name = 'No Name' then 0 ELSE 1 END) as count_cat
FROM cat_adoption_fact
WHERE adopted_date != ‘2099-12-31’
GROUP BY adopted_month

This would tell the user I only have two cats (sigh), both adopted in October, but notice how much more domain knowledge the user has to have. Not only does this require more functions (more about that below), but it also requires the user to know that the default value for cat_name is “No Name”, while the default value for adopted_date is 2099-12-31. These choices were not totally mine; since the adopted_date field has data type Date, it can only contain a Date value or a NULL. So we cannot fill both fields with “Not Applicable.” This would also be true of a Boolean, which can only be True, False or Null. “Unknown” or “Not Applicable” are not valid Boolean fields. Since a table (not to mention a data mart) are likely to have multiple data types, the onus will always be on the user to know the value of the default for each type. Additionally, since these default values look the same to the database as any other value, there is nothing ensuring that the default value is consistent across fields with the same data type. So we may see the following in the data warehouse, in addition to the cat_adoption_fact table we may have a cat_color_dim table:

Cat_color_id Cat_color_name
1 Black
2 Torty
3 Unknown

The end user, who may now feel that they have the hang of “No Name”, will now have to manually inspect this table in order to write this query in order to find the count of cats for each color:

SELECT 
	cat_color_name,
	SUM(CASE WHEN cat_name = ‘No Name' then 0 ELSE 1 END) as count_cat
FROM cat_adoption_fact
WHERE cat_color_name != ‘Unknown’
GROUP BY cat_color_name

Notice that, within this one query, the user has to know two distinct default values. This is asking a lot of the user and setting them up for potential incorrect results.

Cleaner Code

Database management systems have built in functions and default behavior that only apply to NULL. Leveraging out of the box functions keep your code cleaner for easier understanding and incrementing.

I will be focusing on Snowflake to keep the explanations concise, but know that another RDMS may have an equivalent function.

Default Behavior

Assume our table was contained NULLs, like this:

Cat_id Cat_name Adopted_Date Cat_color_id
1 Mac 2017-10-01 1
2 Cheese 2017-10-01 1
3 NULL NULL NULL

If we rerun our first two queries, they get different results.

SELECT COUNT(*) FROM cat_adoption_fact;

By default, Snowflake will not count any NULL values (source), so this will return 2. I only have 2 cats. Stop rubbing it in.

SELECT 
	MONTH(adopted_date) AS adopted_month, 
	COUNT(*)  AS cat_count
FROM cat_adoption_fact
GROUP BY adopted_month

This query results actually will include NULLs, but as its own line:

adopted_month cat_count
10 2
NULL 1

This makes it clear to the user that the third cat has not been adopted. The following functions also only aggregate non-NULL values:

Using 0, -1, or any other numeric value to indicate NULL will impact these functions.

Note: You can combine the default behavior of COUNT with the grouping to create the following query:

SELECT 
	MONTH(adopted_date) AS adopted_month, 
	COUNT(cat_name)  AS cat_count
FROM cat_adoption_fact
GROUP BY adopted_month

This would result in the following:

adopted_month cat_count
10 2
NULL 0

Some may prefer that result set to indicate that there is no third cat.

Built in Functions

Snowflake’s built in functions that handle NULLs are:

Without these built in functions, a query as simple as:

SELECT 
	COALESCE(cat_nickname, cat_name) AS cat_name
FROM cat_dim;

May have to be written as:

SELECT 
	CASE WHEN cat_nickname = 'No Nickname'  THEN  cat_name ELSE cat_nickname END AS cat_name
FROM cat_dim;

Notice how less concise the second query is despite the fact that this is simple logic. In more complicated queries, functions such as COALESCE can help readability and therefore productivity.

But what about ORDER BY?

An argument for default values such as 2099-12-31 is that this would allow adopted cats to always precede not-yet-adopted cats in an order by. However, by default, NULLs come last in an ordering of dates anyways, but if you wanted the default date to come first in this case you could do:

SELECT 
*
FROM cat_adoption_fact
ORDER BY adopted_date DESC

This is actually easy to replicate with NULLs and the query is more explicit:

SELECT 
*
FROM cat_adoption_fact
ORDER BY adopted_date NULLS FIRST

Using NULL actually gives you more control over ORDER BY. Window functions even have a feature to IGNORE NULLs (See an example here).

But are there gotchas?

Yes. NULL does not equal NULL. So if you try to join two tables, if both values in your join are NULL, those two record will not be considered a match. However, the workaround to this is the built in EQUAL_NULL function in Snowflake.

TL;DR

NULL has its own meaning; it cannot be replaced with a default value. It may feel counterintuitive, but using NULL will actually create cleaner, more explicit code and data sets that are easier for end users to consume.