A Good Day to Delete Hard

Deletion method is an important data warehousing concept, one you’d ideally master before an interview. It is not a matter of memorizing the definitions of the two main methods (which I copy from the internet for you below), but rather understanding the consequences of choosing one over the other.

What are the types of deletes?

The deletion methods describe how a database (or a specific table) changes when a record is set to be deleted.

Hard Deletes - The entire record is deleted from the table. After the delete, users will not be able to see that the record ever existed. Soft Deletes - The record is not removed from the table, but a field on the table indicates that it is deleted. The field can be a boolean such as is_deleted or a timestamp such as deleted_timestamp. It’s usually best if this field name is consistent across all tables so users can readily and dynamically find the non-deleted data set. (Source)

When should we use hard deletes?

Soft deletes are not a good idea in an analytic data warehouse. Stakeholders expect to see a table with only relevant records; for analytical data warehouse users, cats_dim means “a list of cats who are currently active.” See the table below as a counter example:

cat_id cat_name is_deleted
1 Mac N
2 Cheese N
3 Sushi Y

An analyst (who likely hasn’t read all the field names, because there’s probably more than just the three provided) may quickly try to find the count of current cats with select count(*) from cats_dim

This would give her three cats, instead of the real answer of two (of course we couldn’t adopt Sushi, although he was cute. Our hearts are already full).

This isn’t the worst mistake in the world, but imagine instead of count(*) it was sum(revenue). Now imagine that metric was reported to an executive. You see where I’m going with this.

So by convention, in a data warehouse, we do hard deletes. But that doesn’t mean we do them everywhere.

When should we use soft deletes?

We know that hard deletes are helpful when users are expecting only current data, but there are other types of users out there. For example, the team that writes the ETL from source databases - us!

As part of the ETL, we retrieve datasets from a source database either as a full extraction or incremental extraction. In a full extraction, we pull the entire dataset every time we need more data (the dataset is usually a table, but it can be a report at an API endpoint). This is the most computationally expensive way to get data, but the easiest to code.

In the case of an incremental extraction, we only query for the data we have not seen yet from the source. We may write this query using a metadata field that represents the last time that record was modified in the source database (I like to call this field last_modified_timestamp, but often you’ll see it called systimestamp or sysmodtimestamp by convention). In our extraction query, we effectively say “yesterday I got data up until 2019-01-01, please give me the data that has been updated since then.” This allows us to extract and load fewer rows, saving compute and possibly storage.

So what happens if a source database does hard deletes? Let's say we extract data a table called cat_hospitalization_fact in the source.

cat_id cat_name hospitalization_date hospitalization_reason last_modified_timestamp
1 Mac 2019-12-17 Ate thread from a sewing basket 2019-12-18
2 Cheese 2018-12-01 Ate hair elastics 2018-12-02
3 Sushi 2019-12-18 Neutering 2019-12-19

If we incrementally extracted from this table on 2019-12-19, we received all three of those rows. Now let’s say we revisit the table on 2019-12-20, when it looks like this:

cat_id cat_name hospitalization_date hospitalization_reason last_modified_timestamp
1 Mac 2019-12-17 Ate thread from a sewing basket 2019-12-18
2 Cheese 2018-12-01 Ate hair elastics 2018-12-02
5 Tuna 2019-12-19 Neutering 2019-12-20

When we extract, we will get one new row about Tuna. Our table in our database will look like this:

cat_id cat_name hospitalization_date hospitalization_reason last_modified_timestamp
1 Mac 2019-12-17 Ate thread from a sewing basket 2019-12-18
2 Cheese 2018-12-01 Ate hair elastics 2018-12-02
3 Sushi 2019-12-18 Neutering 2019-12-19
5 Tuna 2019-12-19 Neutering 2019-12-20

When compared to the source, this is wrong because it still has Sushi's hospitalization. For whatever reason, Sushi’s hospitalization was deleted. In order to have known this, we would have had to set the extraction date back to 2019-12-18, but, since we have no way of knowing which records will be deleted when, we’d actually have to set the extraction date back to the beginning of time to account for hard deletes. This would just be a full load.

When a source database has hard deletes, we have to do a full extraction to ensure we don’t misrepresent the data. This is expensive, especially as the data grows. If the source database had used soft deletes, it would look like this:

cat_id cat_name hospitalization_date hospitalization_reason is_deleted last_modified_timestamp
1 Mac 2019-12-17 Ate thread from a sewing basket N 2019-12-18
2 Cheese 2018-12-01 Ate hair elastics N 2018-12-02
3 Sushi 2019-12-18 Neutering Y 2019-12-20
5 Tuna 2019-12-19 Neutering N 2019-12-20

When we extracted the data for the second time (on 2019-12-20), we would have received:

cat_id cat_name hospitalization_date hospitalization_reason is_deleted last_modified_timestamp
3 Sushi 2019-12-18 Neutering Y 2019-12-20
5 Tuna 2019-12-19 Neutering N 2019-12-20

This would have indicated to us that Sushi’s record was deleted without having to pull the entire dataset. We can keep our incremental loads but not overstate data.

Looking at it from the perspective of the source database owners, another advantage of soft deletes is that they are easier to undelete.

So what do we do once we have this soft delete? It’s easy, we can use it in the merge statement (this example is Snowflake specific).

Our merge statement would look like this:

MERGE INTO cat_hospitalization_fact as t
USING source_data as s
ON t.cat_id = s.cat id
AND t.hospitalization_date = s.hospitalization_date
WHEN MATCHED THEN UPDATE SET
t.cat_name  = s.cat_name,
t.hospitalization_reason = s.hospitalization_reason,
WHEN NOT MATCHED THEN INSERT
(cat_id, cat_name, hospitalization_reason, hospitalization_date)
VALUES 
(s.cat_id, s.cat_name, s.hospitalization_reason, s.hospitalization_date)
WHEN MATCHED AND s.is_delete = ‘Y’ THEN DELETE

Notice the last line– we can use the soft delete from the source database to delete the record from our table. With hard deletes, we cannot do this; the absence of a row can’t create a delete in a merge, only a field indicating its deletion. In the case of hard deletes, we have to do a separate DELETE statement.

Conclusion

There is no clear winner between soft deletes and hard deletes; each has its own usage and advantages. That being said, and I can’t stress this enough, do not let your cats eat thread. Also, can I borrow some money? Cat surgery is expensive.