Recently, I had a situation where there were a multitude of ways a set of events could occur to a particular data point over time – where I needed to know what every possible pattern of events had occurred to that single data point. In this post, I’ll walk through the scenario of when you’d want to do something like this and how find patterns in a dataset.
Imagine you have a source system which allows a customer to interact with your front-end application, like updating their profile, and you want to know how they update their profile and all the different patterns in which they go about interacting with your system. To do that, you’ll need to recursively join to your data and build that pattern, in a set-based way, achieving superior performance with a very large dataset.
CTEs are a powerful feature of modern RDBMSs which allow you to do some very creative things with set-based data. Some systems even allow you to nest them inside of themselves for even more crazy, creative solutions. Let’s discuss when to use a CTE.
The word “common” from the acronym CTE (Common Table Expression) means you want to use a query more than once — because it’s common.
Use Google Sheets to automate your data pipeline development:
This isn’t the first time I’ve talked about a code generator or SQL Generation on this blog, but it’s worth discussing again because I wanted to talk about a recent project where I upped the ante on not just generating SQL, but generating the DDL & DML to support an entire ETL pipeline — all while self-documenting everything!
Throughout the project I was able to quickly test different indexing strategies without writing a single line of code.
A common gripe I have when reviewing and maintaining other people’s SQL, aside from the obvious readability due to poor formatting, is their lack of aliases. In this post, I’d like to talk about why you should always reference aliases. The one major reason why, and the one minor reason when not to.
When maintaining and improving upon legacy code it’s easy to redefine standards, or define nonexistent standards, which have cascading effects on the pipeline. Take something as simple as an ill-conceived naming convention which, in hindsight, turned out to not make sense over time. At some point the benefits of rewriting lots of code outweigh simply sticking with the bad convention. In these cases consistency over standards become the convention.
I recently was tasked with building a new table with two Type-2 tables as a source and, not only maintain the history but, rebuild the history. Let’s look at the best way to join two type-2 tables together and then, more specifically, how to join the two historicized tables together while preserving the logical history of changes as though they were one.
Imagine you have two tables, typically type-2 dimensions, which have persisted historical records of changes: One for capturing your customer’s name over time, and another for tracking your customer’s phone number changes over time. In some cases, you may not have a historical record in one table, or the other, leaving an unknown name or phone number captured for a given customer. When re-building history you have to take care of these new periods of time where something hasn’t happened yet.
Now imagine that a customer changed her name three times and only on that third time did she also provide a phone number. That means, our third historical entry of her name change could potentially turn into five records, depending on when the phone number insert/update occurred.
Regardless of your RDBMS you’ll find more than a handful of scripts online to add a Date Dimension or Calendar Table to your schema. One problem I’ve seen is they lack an every-other-week column. So, let’s explore a common SQL Server script and how to add an Every Other Week flag to our date dimension.
This method is written for Microsoft’s SQL Server, but it is ANSI standard and will work with any date dimension which has an integer day-of-week column, which I’ve never seen one that doesn’t…
In ETL, we often have to load many targets from a common set of base tables. Inevitably the targets are different enough that we have to create multiple queries or views to populate the many outputs of data. Which is fine, except now you’ve got yourself a maintenance nightmare, one that is avoidable. I’d like to share a trick with you to take a single query that can be recursively modified to dynamically change its structure to get different outputs. I call it, dynamic querying using block quotes.
The business case:
The sales department would like to take a single report that already exists and split it into two reports. The additional report will require different fitlers, aggregates, columns, and even joins!