There are two types of relational database management systems: columnar and row-based, also called: column-wise and row-wise, column store and row store, column-based and row-based, and column-oriented and row-oriented.
As always, I will be explaining this concept to my nephew Rhys (pronounced Reese, thanks) who is turning six in July. I will be explaining the two types to Rhys via a long analogy about his toys.
Okay Rhys, first we want to show the benefits of a row-based database. Since the difference between row and column-based databases mostly pertains to their method of storage, we are going to think through two different ways to store your toys: by game or toy type. If you store by game, that’s row-based; by toy, that’s column-based.
Let’s start with storing by game. You could store by game if every time you play with a set of toys, you put the combination of toys in a toy box when you’re done. If you played a game for which you need three stuffed dinosaurs, two Barbies, and some playdough, you store those all in the same box when you’re done and retrieve them all from the same box each time you want to play again. This makes setting up for a game much easier. You may even number the boxes (blocks) so you can remember which one is which (row id). In this analogy, we have to assume infinite toys, which honestly isn’t too big of an imaginary stretch. So now you have loads of similar sized boxes with different toys in them.
Similar to how this will make it easy for you to store and retrieve individual games, row-based databases are optimized for inserting and retrieving individual records, because the entire row is stored together. A good use case for this would be when you fill out your order information on a site; all of that data is stored all together at the same time. When your order summary email is sent, this is a retrieval of your record of data, which hopefully is sent quickly. In addition to speed though, we have to consider storage.
If your mom bought you boxes of a similar size, some may have a lot of empty space, while others toys may not fit so we will have to use a second box for them. This is a waste of space and hard to keep track of; how do we know when a game will require us to go get two boxes instead of one! Therefore, although it’s quick to get our toys ready for the game, we only have so much room in our house for boxes, and we’re sometimes wasting space
Here’s another drawback: what if I asked you how many Barbie dolls you have? It would take forever for you to count them. You would have to check each box, even those that you found ultimately didn’t have any Barbies. That’s an inefficient way to get to the answer.
So what if we organized our bins differently, as in a column-based database? What if we had a Barbie doll bin in which we order and label the dolls for what game they’re part of. And a Jeep bin. And a playdough bin. If we wanted to play game #3, we’d to have to check each box to see if it had a toy marked for #3 and then extract that. Setting up and putting away each game would be a pain!
But this way of storage gives us some benefits:
- Storing Barbie dolls together is easier than storing toys of different shapes (this help us with compression, see more info here
- We can now quickly calculate things like count of Barbie dollars, average number of Barbie dolls, or just get all our dolls quickly.
- We will fill the entire Barbie doll bin before we start another bin; there will only be one doll bin with extra space (the last one). This will decrease the total wasted space overall.
So what does this mean? Which way should Rhys store his toys? Well, there is no right way, it depends on what Rhys wants. Does Rhys want to insert combinations and retrieve one game efficiently? Or does he want to do calculations, such as count and average?
Traditional row-based databases include:
- SQL Server
These types of databases are optimized for inserts, especially inserting an entire row at once. They’re also good for retrieving entire rows, like the entire combination of toys Rhys wants to play with. But they are not good for doing analytical work; for that we need a columnar database. Columnar databases include:
The idea here is that when you do analytical work, such as finding the average number of customers per product, you will use two columns but all the rows. When retrieving a user’s profile data, on the other hand, you are accessing all columns and only one row. The latter would be a good use case for a row based database.