ELI5 - Primary Keys

When I get requests to ingest or extract data as tables or reports, I often ask the requestor for the primary key. Sometimes they say the primary key doesn’t exist, or they don’t know what it is, or they don’t know why it’s important. Primary keys are a fundamental part of a data warehouse, but often people regard them as a technicality- something to tack onto the end of a table creation. In reality, they’re the very building block of tables and reports. When you really dive into it, primary keys are a concept we understood as children but later gave a fancy name to. And thus, I can explain the concept to Rhysie. As always, notes (in parenthesis) are the technical terms or concepts that Rhys doesn’t need to know, but you may want to know. Let’s begin.

As defined in ELI5 - Slowly Changing Dimension, a dimension is a thing. (Lovely definition Alisa, I still don’t know why you didn’t get into Harvard.) But a simple list of things would be boring, such as your list of your books:

Alice’s Adventures in Wonderland Robin Hood Curious George Does Some Bad Stuff but Guy in Hat Forgives Him

That list is good for answering “how many books do you have?” and “which books do you have?”, but it doesn’t tell us anything else about the books. We can only count and list them. Boring. So let’s add information about our books:

Book Name Author Description
Alice’s Adventures in Wonderland Carroll Very good book
Robin Hood Unknown Okay book
Curious George Does Some Bad Stuff but Guy in Hat Forgives Him Rey I don’t think we should teach children that primates are pets

No matter how many more things (attributes) we add here, we’re still describing each book. The table is about books. Each line (record) has a book, and each book only has one line. Some define the primary key as the thing that is different on each line (the primary key is what makes each line unique), but it has a more powerful meaning: this table is about books. The primary key is the book, in this case referred to be its name.

There is another kind of table that I didn’t tell you about: a fact table. Not that kind of a fact! We are using the term fact here in a different way. A fact is something that happened, it has a time associated with it. Let’s say you wanted to record all the times you’ve read a book, like this:

Date Pages Read
2018-08-12 20
2018-08-13 5
2018-08-14 15

This would be a fact table, since it’s saying what happened and when. A tell-tale sign of a fact table is that we have the date in there, but of course the exception is SCD.

So every day we read, and we write it down. Perfect. But what are we really describing here? We are describing what you did each day. Each day gets a new line. So your primary key is the date!

But let’s say your teacher wants to know exactly what you are reading, so we start keeping track of the books you’re reading too:

Date Pages Read Book Name
2018-08-12 10 Alice’s Adventures in Wonderland
2018-08-12 10 Robin Hood
2018-08-13 5 Alice’s Adventures in Wonderland
2018-08-14 15 Alice’s Adventures in Wonderland

But wait, we now have two lines for the 12th of August because we are no longer describing what happens each day. We are now describing what happens each day to each book, meaning the primary key is now the date and the book name. (To be clear, the primary key is driven by the definition of the event, not the data that is in the table; the primary key would still be (date, book name) even if the table were empty.)

One final note on primary keys; they can tell us the relationship between our two lists. Every book in our reading list should exist in the book list, right? (referential integrity) So Book Name in the reading list is a foreign key to Book Name in the Book list; it’s also part of the primary key, which is totally accceptable.

See, primary keys aren't that hard.