Use Google Sheets to automate your data pipeline development:
- ETL Generator
- DDL Generator
- DML Generator
- Documentation Generator
- Code Generator
This isn’t the first time I’ve talked about a code generator or SQL Generation on this blog, but it’s worth discussing again because I wanted to talk about a recent project where I upped the ante on not just generating SQL, but generating the DDL & DML to support an entire ETL pipeline — all while self-documenting everything!
Throughout the project I was able to quickly test different indexing strategies without writing a single line of code.
A number of source system tables which are replicated in a truncate-and-reload fashion, daily. For anyone, who’s not stuck as some shitty consulting firm *cough* Hi, James. middlefinger.jpg *cough*, knows that this is no way to properly source data. Let alone disaster-recoverability, data-backfilling, and financial compliance – to name a few. Therefore, I was tasked with:
- Capture all data changes over time.
- Historicize the FULL data-set (~60 billion records processed per load).
- Flag erroneous records for un-expected NULLs or Data Truncation.
- Flexible to change.
The last bullet point is why I went with developing a code generator. There are many steps involved when building a data pipeline. And data pipelines are inherently fragile. So, coming up with a solution that would allow me to quickly adapt to change as the project matured was crucial to my success. Throughout the project I was able to quickly test different indexing strategies without writing a single line of code.
Unfortunately I cannot share any screenshots or code samples, but I can share the theory to enable you to accomplish the same. Even if I did share with you the gSheet with a working example, the code generator is specific to Teradata, warehouse metadata-fields, and uses proprietary functions/procedures. So, you’re on your own on this one 🙂
- Start with documentation.
- Using Google Apps Scripts, develop your solution.
- Write metadata to sheet and generated code to gDrive.
Start With Documentation
For me, this started out as me documenting the source-to-target mappings of how I wanted to define my schemas (database, table, column name and datatype) for staging, production, and error tables. The first thing I defined was the source schema for all tables, then I added columns for data profiling:
- Record Count
- Distinct Count
- Distinct Percent
- Null Count
- Null Percent
- Empty-set Count
- Empty-set Percent
- Minimum Value
- Maximum Value
- Minimum Length
- Maximum Length
- Average Length
Once I had the source-schema and source-metadata it hit me that if I were to add some flags about the data I could EASILY build the DDL for my schema, which then lead to me realizing I could take it a step further and also generate all the DML, too! Below are some ideas for metadata-flags which you can use in your code generator to handle the logic:
- Should this field be loaded to next layer?
- Should I historicize this field? (Track changes over time)
- Unique Primary Index column?
- Non-Unique Primary Index column?
- String-based filter used in where clause.
- Flag record as Error if field is NULL.
- Flag record as Error if Duplicate based on this/these fields.
- Is data case-sensitive?
- Can this field be NULL?
- If NULL, what is the default value?
- Is there a specific formatting for field?
- String of field contents if a derive field.
My idea was to have a set of columns for each schema and a metadata-flagging system, like above, for each layer. At the end of the day, you’d be left with a fully documented pipeline highlighting each layer of data movement and the logic used to move the data along. Each layer could have different column names and datatypes as you wish. All of this, fully documented!
I also recommend including an order so you can also define the order of your tables.
Using Apps Scripts to Develop Your Solution
Here’s the gist:
- Read metadata from sheet and populate a massive array of lists and sublists/subarrays. I recommend doing this in a dynamic way, meaning that if you want to add a column on metadata later, it will be easy to add it to your array without having to play with column indexes. (This is a lot of data! For an array.)
- Loop over your array as needed to build indexes, select statements, where/on clauses, and other code.
- Store the generated code into smaller chunks and bring them together at a later step.
- Build in error checks wherever you can, things like: Am I defining a UNPI and NUPI at the same time? Or, do my data type changes potentially cause data-truncation?
Google’s App Scripts have a byte limit and you’re guaranteed to hit that, so split your generated code up and bring it together at a final step.
Print Metadata And Store Generated Code To gDrive
Throughout your code you’ve, hopefully, captured potential bugs in DDL and DML which can save you time by catching silly issues which you erroneously introduced your documentation step, which is admittedly easy to do. I do this by creating an array of errors and a method to add them to the array which also not only captures the error, but also an error message and location where the error was found. Once complete, I’ll de-dupe the error list and display them to the user to assess if they’re false-positives or not.
Finally, I found it awesomely helpful to save all the code, for any/all environment (development, test, and production), in a unique folder in my gDrive organized in any logical way. This allows me to share the code and export it as a zip file to my local machine for execution.
In total, at the time of writing this, I was able to document and generate code for source to target ETLs and its accompanying DDL:
Source: ~20 tables read
Code Written: ~800 lines
Code Generated: ~30,000 lines
DDL Generated: ~80 objects (including views)
Row Count: ~60 Billion Records Processed
Full Load Time: ~1.5 Hours