Join Two Type-2 Tables and Rebuild History

I recently was tasked with building a new table with two Type-2 tables as a source and, not only maintain the history but, rebuild the history. Let’s look at the best way to join two type-2 tables together and then, more specifically, how to join the two historicized tables together while preserving the logical history of changes as though they were one.

Joining two type two tables with history maintained.
Join Two Type-2 Tables

The Setup:

Imagine you have two tables, typically type-2 dimensions, which have persisted historical records of changes: One for capturing your customer’s name over time, and another for tracking your customer’s phone number changes over time. In some cases, you may not have a historical record in one table, or the other, leaving an unknown name or phone number captured for a given customer. When re-building history you have to take care of these new periods of time where something hasn’t happened yet.

Now imagine that a customer changed her name three times and only on that third time did she also provide a phone number. That means, our third historical entry of her name change could potentially turn into five records, depending on when the phone number insert/update occurred.

The Caveat:

When a record starts or ends can vary depending on your business case and engineering decisions. Therefore, if your solution has infinite start-dates or lagging end-dates, there may be some tweaks required in the CTE portion of the code.

This post assumes a finite start date and infinite end dates for active records. This post also assumes that an end date is the same as the next record’s start date. It could be that a start date is a year, month, week, day, second, or millisecond ahead of the previous end date or vice versa.

The Data With Scenarios:

DDL for Microsoft’s SQL Server with output below:

Two Type2 tables with sample data.

Two Type-2 Tables with sample data of varying date ranges.

The Premise:

Take a look at the data set above in the two tables, one in green and the other in yellow. Notice that every date is January first with only the year changing; I did this to keep it simple and easy to understand. Also notice that the year 2999 is representative of a date far in the future, to signify infinity.

Infinity or, another way of putting it, in about 980 years someone is going to have to rewrite this code!

What we’re challenged with is when the begin and end dates don’t line up perfectly with the other corresponding rows of historical data in the other table. Most of the dates throughout our history overlap, but not perfectly. Therefore, we’re left with situations where a name changed but not a phone number or a phone number changed but not a name. Each of these scenarios needs to be solved to re-build a perfect historical representation of our customer’s phone and name changes within a single table.

Building The Query:

First, let’s build the  CTE (common table expression / with statement) which will drive the rest of the query to join two type-2 tables:

Output of CTE, the inner logic to join two type-2 tables
Output of CTE, the inner logic to join two type-2 tables

The output of the CTE includes all the pieces of the puzzle to properly rebuild the history of two type-2 tables. Next, we’ll put together the CTE with the statement which unions together all the possible data scenarios:

Output of unioned data which solves for each scenario.
The output of unioned data which solves for each historical scenario.

In the next step, we’ll wrap the above statement to remove erroneous data from the unions, they’re the records in red; Which, are quite easy to spot, they both have the same begin and end dates. So, logically, we’ll filter those out:

Final output of two type-2 tables joined together.
The final output of two type-2 tables joined together.

The final output and expected results are above, in blue. Notice that each scenario is accounted for, which includes end dates that are both current and in infinite. Too, it’s important to call out, the begin date is the same as the previous end date. Depending on your architecture, you may have end dates which are one day less to allow inclusive “between” clauses. To solve for this you’ll need to play around with subtracting a day in the CTE dataset, which is trivial to implement.

*  *  *

This was a very fun and challenging project and I’m writing this post for posterity and maybe it will help someone else out in the future. Hopefully, there are no bugs in this code as I’ve tested this on billions of rows of data but, I suppose, there could be more scenarios I didn’t think of or possibilities not found in my production dataset… however, I don’t think so 😉