Much Ado About NULL Things
We have to talk about NULLs.
This post won’t include anything revolutionary, but rather just a summary of what NULLs are, what they aren't, and why they are important in data warehousing. It’s based on my own experience, not anything academic.
What is NULL?
NULL means nothing, but NULL is not “Nothing.” NULL means that “a data value does not exist in the database” (Source). The value simply isn't there; the value is not 0, N/A, or False. N/A, for all its ambiguity, is a value. Resisting the temptation to create a default value for a field may be counter-intuitive, but using NULL correctly can benefit both the end user and you.
Why does using NULL help the end users?
Correct Aggregated Results
Let’s say you have a table that includes both dates and names, like the following:
Cat_id | Cat_name | Adopted_Date | Cat_color_id |
---|---|---|---|
1 | Mac | 2017-10-01 | 1 |
2 | Cheese | 2017-10-01 | 1 |
If we want to add a future cat but we don’t use NULL, we may put in an entry such as the following to communicate that I don’t yet have a cat:
Cat_id | Cat_name | Adopted_Date | Cat_color_id |
---|---|---|---|
1 | Mac | 2017-10-01 | 1 |
2 | Cheese | 2017-10-01 | 1 |
3 | No Name | 2099-12-31 | 3 |
If you use intuition, the results of a select *
will make it clear that I adopted two cats. However, this is a small table; in bigger tables we are unable to view all records at once. As a result, users often use aggregates such as:
SELECT COUNT(*) FROM cat_adoption_fact;
or
SELECT
MONTH(adopted_date) AS adopted_month,
COUNT(*) AS cat_count
FROM cat_adoption_fact
GROUP BY adopted_month
The former query gives 3 results, which may indicate to the stakeholders that I have three cats when, unfortunately, I do not. The latter query is more of an issue though. The results are:
adopted_month | cat_count |
---|---|
10 | 2 |
12 | 1 |
This table clearly tells the end user that I adopted two cats in October and one in December. I wish.
Easier Querying
Since a default is automatically different depending on the data type, using NULL is the only way to gain consistency across fields and data sets.
For instance, from the above cat_adoption_fact table, in order to correctly get the results expected from the above query, a user would have to write:
SELECT
MONTH(adopted_date) AS adopted_month,
SUM(CASE WHEN cat_name = 'No Name' then 0 ELSE 1 END) as count_cat
FROM cat_adoption_fact
WHERE adopted_date != ‘2099-12-31’
GROUP BY adopted_month
This would tell the user I only have two cats (sigh), both adopted in October, but notice how much more domain knowledge the user has to have. Not only does this require more functions (more about that below), but it also requires the user to know that the default value for cat_name is “No Name”, while the default value for adopted_date is 2099-12-31. These choices were not totally mine; since the adopted_date field has data type Date, it can only contain a Date value or a NULL. So we cannot fill both fields with “Not Applicable.” This would also be true of a Boolean, which can only be True, False or Null. “Unknown” or “Not Applicable” are not valid Boolean fields. Since a table (not to mention a data mart) are likely to have multiple data types, the onus will always be on the user to know the value of the default for each type. Additionally, since these default values look the same to the database as any other value, there is nothing ensuring that the default value is consistent across fields with the same data type. So we may see the following in the data warehouse, in addition to the cat_adoption_fact table we may have a cat_color_dim table:
Cat_color_id | Cat_color_name | |
---|---|---|
1 | Black | |
2 | Torty | |
3 | Unknown |
The end user, who may now feel that they have the hang of “No Name”, will now have to manually inspect this table in order to write this query in order to find the count of cats for each color:
SELECT
cat_color_name,
SUM(CASE WHEN cat_name = ‘No Name' then 0 ELSE 1 END) as count_cat
FROM cat_adoption_fact
WHERE cat_color_name != ‘Unknown’
GROUP BY cat_color_name
Notice that, within this one query, the user has to know two distinct default values. This is asking a lot of the user and setting them up for potential incorrect results.
Cleaner Code
Database management systems have built in functions and default behavior that only apply to NULL. Leveraging out of the box functions keep your code cleaner for easier understanding and incrementing.
I will be focusing on Snowflake to keep the explanations concise, but know that another RDMS may have an equivalent function.
Default Behavior
Assume our table was contained NULLs, like this:
Cat_id | Cat_name | Adopted_Date | Cat_color_id |
---|---|---|---|
1 | Mac | 2017-10-01 | 1 |
2 | Cheese | 2017-10-01 | 1 |
3 | NULL | NULL | NULL |
If we rerun our first two queries, they get different results.
SELECT COUNT(*) FROM cat_adoption_fact;
By default, Snowflake will not count any NULL values (source), so this will return 2. I only have 2 cats. Stop rubbing it in.
SELECT
MONTH(adopted_date) AS adopted_month,
COUNT(*) AS cat_count
FROM cat_adoption_fact
GROUP BY adopted_month
This query results actually will include NULLs, but as its own line:
adopted_month | cat_count |
---|---|
10 | 2 |
NULL | 1 |
This makes it clear to the user that the third cat has not been adopted. The following functions also only aggregate non-NULL values:
- AVG
- STTDEV
- SUM
Using 0, -1, or any other numeric value to indicate NULL will impact these functions.
Note: You can combine the default behavior of COUNT with the grouping to create the following query:
SELECT
MONTH(adopted_date) AS adopted_month,
COUNT(cat_name) AS cat_count
FROM cat_adoption_fact
GROUP BY adopted_month
This would result in the following:
adopted_month | cat_count |
---|---|
10 | 2 |
NULL | 0 |
Some may prefer that result set to indicate that there is no third cat.
Built in Functions
Snowflake’s built in functions that handle NULLs are:
Without these built in functions, a query as simple as:
SELECT
COALESCE(cat_nickname, cat_name) AS cat_name
FROM cat_dim;
May have to be written as:
SELECT
CASE WHEN cat_nickname = 'No Nickname' THEN cat_name ELSE cat_nickname END AS cat_name
FROM cat_dim;
Notice how less concise the second query is despite the fact that this is simple logic. In more complicated queries, functions such as COALESCE can help readability and therefore productivity.
But what about ORDER BY?
An argument for default values such as 2099-12-31 is that this would allow adopted cats to always precede not-yet-adopted cats in an order by. However, by default, NULLs come last in an ordering of dates anyways, but if you wanted the default date to come first in this case you could do:
SELECT
*
FROM cat_adoption_fact
ORDER BY adopted_date DESC
This is actually easy to replicate with NULLs and the query is more explicit:
SELECT
*
FROM cat_adoption_fact
ORDER BY adopted_date NULLS FIRST
Using NULL actually gives you more control over ORDER BY. Window functions even have a feature to IGNORE NULLs (See an example here).
But are there gotchas?
Yes. NULL does not equal NULL. So if you try to join two tables, if both values in your join are NULL, those two record will not be considered a match. However, the workaround to this is the built in EQUAL_NULL function in Snowflake.
TL;DR
NULL has its own meaning; it cannot be replaced with a default value. It may feel counterintuitive, but using NULL will actually create cleaner, more explicit code and data sets that are easier for end users to consume.