Recently, I had a situation where there were a multitude of ways a set of events could occur to a particular data point over time – where I needed to know what every possible pattern of events had occurred to that single data point. In this post, I’ll walk through the scenario of when you’d want to do something like this and how find patterns in a dataset.
Imagine you have a source system which allows a customer to interact with your front-end application, like updating their profile, and you want to know how they update their profile and all the different patterns in which they go about interacting with your system. To do that, you’ll need to recursively join to your data and build that pattern, in a set-based way, achieving superior performance with a very large dataset.
CTEs are a powerful feature of modern RDBMSs which allow you to do some very creative things with set-based data. Some systems even allow you to nest them inside of themselves for even more crazy, creative solutions. Let’s discuss when to use a CTE.
The word “common” from the acronym CTE (Common Table Expression) means you want to use a query more than once — because it’s common.
Use Google Sheets to automate your data pipeline development:
This isn’t the first time I’ve talked about a code generator or SQL Generation on this blog, but it’s worth discussing again because I wanted to talk about a recent project where I upped the ante on not just generating SQL, but generating the DDL & DML to support an entire ETL pipeline — all while self-documenting everything!
Throughout the project I was able to quickly test different indexing strategies without writing a single line of code.
I recently was tasked with building a new table with two Type-2 tables as a source and, not only maintain the history but, rebuild the history. Let’s look at the best way to join two type-2 tables together and then, more specifically, how to join the two historicized tables together while preserving the logical history of changes as though they were one.
Imagine you have two tables, typically type-2 dimensions, which have persisted historical records of changes: One for capturing your customer’s name over time, and another for tracking your customer’s phone number changes over time. In some cases, you may not have a historical record in one table, or the other, leaving an unknown name or phone number captured for a given customer. When re-building history you have to take care of these new periods of time where something hasn’t happened yet.
Now imagine that a customer changed her name three times and only on that third time did she also provide a phone number. That means, our third historical entry of her name change could potentially turn into five records, depending on when the phone number insert/update occurred.
One of my favorite interview questions is to hand the candidate a marker and ask them to write out how to find all duplicates in a table. This should be straightforward and weeds out anyone who struggles with SQL; and even if they don’t struggle with SQL, this will be a good way to gauge where they’re at. But, it doesn’t stop there!
I’ve since evolved a little since I posted this and would like to hone my focus of this post to a clearer target: Look For Ridiculous instead.
After they successfully give an answer, typically one involving grouping by the business key having a count greater than one, which I’ll show in the first example below, that is a go-to correct response to this question. But, I throw them a curve ball, I’ll say “Great! Show me another way.” Then, I ask for another, and another, and another… with great power comes great responsibility being the interviewer is nefariously fun!
So, let’s take a look at some ways on how to find all duplicates in a table by exploring all the ways that I’ve come up with. Here’s our test data we’ll be working with, which has 2 sets of duplicates:
Create statement with data for examples used in this post.
The goal is to learn how to find all the duplicates in a table and return only 2 rows of data, ie: that have more than one identical row, even though there are 5 rows of duplicates (three dupes of one row and two of another).