How to Say It for Data Engineers

Aug 12, 2019 · 5 min read

I used to think people who made a big deal out of small syntactical or idiomatic differences were just being pedantic and condescending. I still sometimes think that, but I’ve come to appreciate how the language we use to talk about our tools and products impacts our behavior. The following are examples of common usages of imprecise language and an explanation of why that language can be damaging.

What we say	What we should say	What is the difference
Can you approve this pull request?	Can you review this change?	We want our teammates to know that we are open to and looking for their feedback, not just looking for their rubber stamp. We don’t want to imply that their comments would just slow us down or get in the way; rather, we want to encourage them to review at their own speed and ask any questions they have. This language allows for a more thorough review- fewer bugs!
The ETL	The X pipeline	The concept of “the ETL” is a thing of the past; grammatically, it doesn’t even make sense to say “the extract, transform, and load.” When people say that they usually mean “the huge ETL job that we have that runs overnight and updates all our tables.” This promotes the idea that everything needs to be a monodag or everything has to run as part of one job. In reality, we should think about individual pipelines as groups of dependencies that bring some value to stakeholders: the order pipeline, the customer pipeline, the job that sends the data to marketing, etc. Speaking of these as individual gets us away from the knee jerk reaction to add everything “to the ETL” just because it’s there. Making things depend on each other (as in a DAG) that don’t truly depend on each just creates work for you. We also want to separate pipelines because each pipeline can have its own SLA. Try to move away from thinking of the ETL as a giant being and think about the individual components we deliver to customers
ETL tool	Workflow management system	The tools we use now, such as Airflow, can do more than just extract, transform and load. They “[provide] an infrastructure for the set-up, performance and monitoring of a defined sequence of tasks” (Source) The tasks may be pipelines that write to the warehouse, but they may also be pipelines that email team members. Additionally, the performance and alerting that come out of the box with these tools are an integral part of their function. These aren’t just shovels for data; they help us build sophisticated applications for comprehensive data solutions. We should talk about them in a way that highlights all their features.
Put it in Git	Committed it to master/our repo	You don’t put anything in Git. Git doesn’t want your code (well actually, it’s open source you can add code here). Git is a system that helps you interact with changes you’ve made to a shared repository. And Git is not the same as Github or Bitbucket. Github hosts your Git repository and adds features, such as the pull request and comment features. You’re adding the code to the master repository via Git and Github; you don’t add code to Git.
Put the data in Looker/Periscope/Tableau	Put the data in the underlying database so the reporting tool can see it	Reporting tools are layers on top of the database. This is important because you don’t want people to think you have two different data sets; one for your database and one for your reporting tool. They may not realize that the two are the same or that they can get the same data from the other tool if that helps them. They may think they have to request everything to be put into the reporting tool individually after its in the database. If you explain this relationship to your stakeholders, it can help them understand how the tools you manage interact and how they can best use them.
The table has no primary key	The table doesn’t have a primary key defined	All tables have primary keys, even if it’s all of the columns. Without primary keys, it would be the wild west. The primary key is actually part of the table definition, not a reflection of the data inside it; even a table with no records has a primary key. I know that’s kind of weird, but just because you didn’t define the primary key doesn’t mean it doesn’t exist. If it doesn’t exist, you don’t have a table; you just have rows of data.
The database is down	Insert more precise error here	In my career, Snowflake, as a database management system, has gone down twice; more commonly, there is a specific error that you want to communicate. Errors that people may perceive as the database going down are: Snowflake says this table doesn’t exist, Snowflake won’t let login because I have the incorrect account name, or I can’t see the list of tables. Precision in error reporting helps promote faster fixes, but also prevents stakeholders or others from overhearing and developing suspicions about our products
Woman engineer	Female engineer	Actually, I prefer engineer, but I do identify as female so occasionally this comes up. Let’s be clear: nobody says man engineer, nobody says man president, nobody says man doctor. When you say woman engineer, you’re saying two nouns; you’re creating an entity. When you say female engineer, you’re qualifying or describing an existing entity. Female engineers are a subset of engineers, not their own category.