I recently was tasked with building a new table with two Type-2 tables as a source and, not only maintain the history but, rebuild the history. Let’s look at the best way to join two type-2 tables together and then, more specifically, how to join the two historicized tables together while preserving the logical history of changes as though they were one.
Imagine you have two tables, typically type-2 dimensions, which have persisted historical records of changes: One for capturing your customer’s name over time, and another for tracking your customer’s phone number changes over time. In some cases, you may not have a historical record in one table, or the other, leaving an unknown name or phone number captured for a given customer. When re-building history you have to take care of these new periods of time where something hasn’t happened yet.
Now imagine that a customer changed her name three times and only on that third time did she also provide a phone number. That means, our third historical entry of her name change could potentially turn into five records, depending on when the phone number insert/update occurred.
Regardless of your RDBMS you’ll find more than a handful of scripts online to add a Date Dimension or Calendar Table to your schema. One problem I’ve seen is they lack an every-other-week column. So, let’s explore a common SQL Server script and how to add an Every Other Week flag to our date dimension.
This method is written for Microsoft’s SQL Server, but it is ANSI standard and will work with any date dimension which has an integer day-of-week column, which I’ve never seen one that doesn’t…
When dealing with transactional data often there are many levels of granularity lying within. Finding these granularities exposes how your data is shaped as it accumulates and helps paint a better picture of what I like to call Lifes within the data. In this post, I want to share a technique I use to find data patterns which will be beneficial for everyone from the analyst to the architect.
Why do I refer to these data patterns as Lifes?
I haven’t found anything transactional in nature that doesn’t have some sort of recurring theme, with a distinct beginning and end, that couldn’t tell a story. It is these finite beginning/ends, start/stops, on/offs that paint the picture that is the “life” of the data. The life of these stories often have many sub-narratives and are interwoven within a single holistic life of the data. A great example is the familiar case of a customer purchase history. The customer is the holistic life of the data, their purchase orders, individual line items, and even a particular line item purchased over several purchase orders are all examples of sub-narratives within the story of a single customer.
One of my favorite interview questions is to hand the candidate a marker and ask them to write out how to find all duplicates in a table. This should be straightforward and weeds out anyone who struggles with SQL; and even if they don’t struggle with SQL, this will be a good way to gauge where they’re at. But, it doesn’t stop there!
I’ve since evolved a little since I posted this and would like to hone my focus of this post to a clearer target: Look For Ridiculous instead.
After they successfully give an answer, typically one involving grouping by the business key having a count greater than one, which I’ll show in the first example below, that is a go-to correct response to this question. But, I throw them a curve ball, I’ll say “Great! Show me another way.” Then, I ask for another, and another, and another… with great power comes great responsibility being the interviewer is nefariously fun!
So, let’s take a look at some ways on how to find all duplicates in a table by exploring all the ways that I’ve come up with. Here’s our test data we’ll be working with, which has 2 sets of duplicates:
Create statement with data for examples used in this post.
The goal is to learn how to find all the duplicates in a table and return only 2 rows of data, ie: that have more than one identical row, even though there are 5 rows of duplicates (three dupes of one row and two of another).
Using the Row NumberWindow Function to flag records if that row is the first occurrence in a series is a performant way to extract more value from your data. By using a Partition By clause we can group chunks of data together while ordering them to figure out which is the first in that series of data. Row Number, with the addition of the Over clause, allows us to achieve this without the use of a subquery by simply wrapping the function in a case statement.
An example of how to find the first occurrence in a series