I recently was tasked with building a new table with two Type-2 tables as a source and, not only maintain the history but, rebuild the history. Let’s look at the best way to join two type-2 tables together and then, more specifically, how to join the two historicized tables together while preserving the logical history of changes as though they were one.

The Setup:
Imagine you have two tables, typically type-2 dimensions, which have persisted historical records of changes: One for capturing your customer’s name over time, and another for tracking your customer’s phone number changes over time. In some cases, you may not have a historical record in one table, or the other, leaving an unknown name or phone number captured for a given customer. When re-building history you have to take care of these new periods of time where something hasn’t happened yet.
Now imagine that a customer changed her name three times and only on that third time did she also provide a phone number. That means, our third historical entry of her name change could potentially turn into five records, depending on when the phone number insert/update occurred.
The Caveat:
When a record starts or ends can vary depending on your business case and engineering decisions. Therefore, if your solution has infinite start-dates or lagging end-dates, there may be some tweaks required in the CTE portion of the code.
This post assumes a finite start date and infinite end dates for active records. This post also assumes that an end date is the same as the next record’s start date. It could be that a start date is a year, month, week, day, second, or millisecond ahead of the previous end date or vice versa.
The Data With Scenarios:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
create table customer_name ( customer_id integer, full_name varchar(30), begin_date date, end_date date ); create table customer_phone ( customer_id integer, phone_number varchar(30), begin_date date, end_date date ); --Scenario 1: Customer 101 -- |-----------|-------|------8 -- |----|-----------------8 insert into customer_name select 101, 'Big Bird', cast('01/01/2003' as date), cast('01/01/2007' as date); insert into customer_name select 101, 'Little Bird', cast('01/01/2007' as date), cast('01/01/2009' as date); insert into customer_name select 101, 'Just Bird', cast('01/01/2009' as date), cast('01/01/2999' as date); insert into customer_phone select 101, '555-123-4567', cast('01/01/2004' as date), cast('01/01/2006' as date); insert into customer_phone select 101, '555-7654-321', cast('01/01/2006' as date), cast('01/01/2999' as date); --Scenario 2: Customer 102 -- |--------------------------8 -- |----|-----|-----------8 insert into customer_name select 102, 'Elmo', cast('01/01/2001' as date), cast('01/01/2999' as date); insert into customer_phone select 102, '555-867-1234', cast('01/01/2002' as date), cast('01/01/2006' as date); insert into customer_phone select 102, '555-867-5309', cast('01/01/2006' as date), cast('01/01/2009' as date); insert into customer_phone select 102, '555-555-5555', cast('01/01/2009' as date), cast('01/01/2999' as date); --Scenario 3: Customer 103 -- |---|----------------------8 -- |----------------------8 insert into customer_name select 103, 'Bert', cast('01/01/2000' as date), cast('01/01/2001' as date); insert into customer_name select 103, 'Ernie', cast('01/01/2001' as date), cast('01/01/2999' as date); insert into customer_phone select 103, '555-666-7777', cast('01/01/2001' as date), cast('01/01/2999' as date); --Scenario 4: Customer 104 -- |----------------------8 -- |--------------------------8 insert into customer_name select 104, 'Von Count', cast('01/01/2002' as date), cast('01/01/2999' as date); insert into customer_phone select 104, '666-555-4444', cast('01/01/2000' as date), cast('01/01/2999' as date); --Scenario 5: Customer 105 -- -- |--------------------------8 insert into customer_phone select 105, '111-222-3333', cast('01/01/2000' as date), cast('01/01/2999' as date); --Scenario 6: Customer 106 -- |----------|-----|--| -- |----|-----------------8 insert into customer_name select 106, 'Grover', cast('01/01/2003' as date), cast('01/01/2007' as date); insert into customer_name select 106, 'Grover Blue', cast('01/01/2007' as date), cast('01/01/2009' as date); insert into customer_name select 106, 'Eric J', cast('01/01/2009' as date), cast('01/01/2010' as date); insert into customer_phone select 106, '222-333-4444', cast('01/01/2004' as date), cast('01/01/2006' as date); insert into customer_phone select 106, '333-444-5555', cast('01/01/2006' as date), cast('01/01/2999' as date); --Scenario 7: Customer 107 -- |-----------|--------------| -- |----|-------------| insert into customer_name select 107, 'Kermit', cast('01/01/2003' as date), cast('01/01/2007' as date); insert into customer_name select 107, 'Mr.Piggy', cast('01/01/2007' as date), cast('01/01/2012' as date); insert into customer_phone select 107, '444-555-6666', cast('01/01/2004' as date), cast('01/01/2006' as date); insert into customer_phone select 107, '333-444-5555', cast('01/01/2006' as date), cast('01/01/2008' as date); --Scenario 8: Customer 108 -- |-----------|--------------| -- |----|-------------| insert into customer_name select 108, 'Wooly Mammoth', cast('01/01/2003' as date), cast('01/01/2007' as date); insert into customer_name select 108, 'Snuffleupagus', cast('01/01/2007' as date), cast('01/01/2010' as date); insert into customer_phone select 108, '555-666-7777', cast('01/01/2004' as date), cast('01/01/2006' as date); insert into customer_phone select 108, '666-777-8888', cast('01/01/2006' as date), cast('01/01/2008' as date); --Scenario 8: Customer 109 -- |-----|-----|--------------| -- |-----------------| insert into customer_name select 109, 'Guy Smiley', cast('01/01/2003' as date), cast('01/01/2005' as date); insert into customer_name select 109, 'Guy', cast('01/01/2005' as date), cast('01/01/2007' as date); insert into customer_name select 109, 'Smiley', cast('01/01/2007' as date), cast('01/01/2011' as date); insert into customer_phone select 109, '777-888-9999', cast('01/01/2006' as date), cast('01/01/2011' as date); |
DDL for Microsoft’s SQL Server with output below:
Two Type-2 Tables with sample data of varying date ranges.
The Premise:
Take a look at the data set above in the two tables, one in green and the other in yellow. Notice that every date is January first with only the year changing; I did this to keep it simple and easy to understand. Also notice that the year 2999 is representative of a date far in the future, to signify infinity.
Infinity or, another way of putting it, in about 980 years someone is going to have to rewrite this code!
What we’re challenged with is when the begin and end dates don’t line up perfectly with the other corresponding rows of historical data in the other table. Most of the dates throughout our history overlap, but not perfectly. Therefore, we’re left with situations where a name changed but not a phone number or a phone number changed but not a name. Each of these scenarios needs to be solved to re-build a perfect historical representation of our customer’s phone and name changes within a single table.
Building The Query:
First, let’s build the CTE (common table expression / with statement) which will drive the rest of the query to join two type-2 tables:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
select coalesce(n.customer_id,p.customer_id) as customer_id ,n.full_name ,p.phone_number ,n.begin_date as begin_date_A ,p.begin_date as begin_date_B ,n.end_date as end_date_A ,p.end_date as end_date_B ,row_number() over (partition by n.customer_id, p.customer_id order by n.begin_date) as begin_date_A_rn ,row_number() over (partition by n.customer_id, p.customer_id order by p.begin_date) as begin_date_B_rn ,row_number() over (partition by n.customer_id, p.customer_id order by n.end_date desc) as end_date_A_rn ,row_number() over (partition by n.customer_id, p.customer_id order by p.end_date desc) as end_date_B_rn ,case when coalesce(n.begin_date, p.begin_date) > coalesce(p.begin_date,n.begin_date) then coalesce(n.begin_date, p.begin_date) else coalesce(p.begin_date,n.begin_date) end as greatest_begin_date ,case when coalesce(n.end_date, p.end_date) < coalesce(p.end_date,n.end_date) then coalesce(n.end_date, p.end_date) else coalesce(p.end_date,n.end_date) end as lowest_end_date ,min(case when n.begin_date > p.begin_date then n.begin_date else p.begin_date end) over(partition by n.customer_id) as first_begin_date ,max(case when n.end_date < p.end_date then n.end_date else p.end_date end) over(partition by p.customer_id) as last_end_date from customer_name n full outer join customer_phone p on n.customer_id = p.customer_id and (n.begin_date between p.begin_date and p.end_date or n.end_date between p.begin_date and p.end_date or p.begin_date between n.begin_date and n.end_date or p.end_date between n.begin_date and n.end_date ) |
The output of the CTE includes all the pieces of the puzzle to properly rebuild the history of two type-2 tables. Next, we’ll put together the CTE with the statement which unions together all the possible data scenarios:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
with mycte as( select coalesce(n.customer_id,p.customer_id) as customer_id ,n.full_name ,p.phone_number ,n.begin_date as begin_date_A ,p.begin_date as begin_date_B ,n.end_date as end_date_A ,p.end_date as end_date_B ,row_number() over (partition by n.customer_id, p.customer_id order by n.begin_date) as begin_date_A_rn ,row_number() over (partition by n.customer_id, p.customer_id order by p.begin_date) as begin_date_B_rn ,row_number() over (partition by n.customer_id, p.customer_id order by n.end_date desc) as end_date_A_rn ,row_number() over (partition by n.customer_id, p.customer_id order by p.end_date desc) as end_date_B_rn ,case when coalesce(n.begin_date, p.begin_date) > coalesce(p.begin_date,n.begin_date) then coalesce(n.begin_date, p.begin_date) else coalesce(p.begin_date,n.begin_date) end as greatest_begin_date ,case when coalesce(n.end_date, p.end_date) < coalesce(p.end_date,n.end_date) then coalesce(n.end_date, p.end_date) else coalesce(p.end_date,n.end_date) end as lowest_end_date ,min(case when n.begin_date > p.begin_date then n.begin_date else p.begin_date end) over(partition by n.customer_id) as first_begin_date ,max(case when n.end_date < p.end_date then n.end_date else p.end_date end) over(partition by p.customer_id) as last_end_date from customer_name n full outer join customer_phone p on n.customer_id = p.customer_id and (n.begin_date between p.begin_date and p.end_date or n.end_date between p.begin_date and p.end_date or p.begin_date between n.begin_date and n.end_date or p.end_date between n.begin_date and n.end_date ) ) select customer_id ,full_name ,phone_number ,greatest_begin_date AS begin_date ,lowest_end_date AS end_date ,1 AS union_id from mycte union all select customer_id ,full_name ,NULL AS phone_number ,begin_date_A AS from_date ,first_begin_date AS to_date ,2 AS union_id from mycte where first_begin_date >= begin_date_A and first_begin_date <= end_date_A and begin_date_A_rn = 1 union all select customer_id ,NULL AS full_name ,phone_number ,begin_date_B AS from_date ,first_begin_date AS to_date ,3 AS union_id from mycte where first_begin_date >= begin_date_B and first_begin_date <= end_date_B and begin_date_B_rn = 1 union all select customer_id ,NULL AS full_name ,phone_number ,last_end_date AS from_date ,end_date_B AS to_date ,4 AS union_id from mycte where last_end_date >= begin_date_B and last_end_date <= end_date_B and end_date_B_rn = 1 union all select customer_id ,full_name ,NULL AS phone_number ,last_end_date AS from_date ,end_date_A AS to_date ,5 AS union_id from mycte where last_end_date >= begin_date_A and last_end_date <= end_date_A and end_date_A_rn = 1 |
In the next step, we’ll wrap the above statement to remove erroneous data from the unions, they’re the records in red; Which, are quite easy to spot, they both have the same begin and end dates. So, logically, we’ll filter those out:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
with mycte as( select coalesce(n.customer_id,p.customer_id) as customer_id ,n.full_name ,p.phone_number ,n.begin_date as begin_date_A ,p.begin_date as begin_date_B ,n.end_date as end_date_A ,p.end_date as end_date_B ,row_number() over (partition by n.customer_id, p.customer_id order by n.begin_date) as begin_date_A_rn ,row_number() over (partition by n.customer_id, p.customer_id order by p.begin_date) as begin_date_B_rn ,row_number() over (partition by n.customer_id, p.customer_id order by n.end_date desc) as end_date_A_rn ,row_number() over (partition by n.customer_id, p.customer_id order by p.end_date desc) as end_date_B_rn ,case when coalesce(n.begin_date, p.begin_date) > coalesce(p.begin_date,n.begin_date) then coalesce(n.begin_date, p.begin_date) else coalesce(p.begin_date,n.begin_date) end as greatest_begin_date ,case when coalesce(n.end_date, p.end_date) < coalesce(p.end_date,n.end_date) then coalesce(n.end_date, p.end_date) else coalesce(p.end_date,n.end_date) end as lowest_end_date ,min(case when n.begin_date > p.begin_date then n.begin_date else p.begin_date end) over(partition by n.customer_id) as first_begin_date ,max(case when n.end_date < p.end_date then n.end_date else p.end_date end) over(partition by p.customer_id) as last_end_date from customer_name n full outer join customer_phone p on n.customer_id = p.customer_id and (n.begin_date between p.begin_date and p.end_date or n.end_date between p.begin_date and p.end_date or p.begin_date between n.begin_date and n.end_date or p.end_date between n.begin_date and n.end_date ) ) select customer_id ,full_name ,phone_number ,begin_date ,end_date from( select customer_id ,full_name ,phone_number ,greatest_begin_date AS begin_date ,lowest_end_date AS end_date ,1 AS union_id from mycte union all select customer_id ,full_name ,NULL AS phone_number ,begin_date_A AS from_date ,first_begin_date AS to_date ,2 AS union_id from mycte where first_begin_date >= begin_date_A and first_begin_date <= end_date_A and begin_date_A_rn = 1 union all select customer_id ,NULL AS full_name ,phone_number ,begin_date_B AS from_date ,first_begin_date AS to_date ,3 AS union_id from mycte where first_begin_date >= begin_date_B and first_begin_date <= end_date_B and begin_date_B_rn = 1 union all select customer_id ,NULL AS full_name ,phone_number ,last_end_date AS from_date ,end_date_B AS to_date ,4 AS union_id from mycte where last_end_date >= begin_date_B and last_end_date <= end_date_B and end_date_B_rn = 1 union all select customer_id ,full_name ,NULL AS phone_number ,last_end_date AS from_date ,end_date_A AS to_date ,5 AS union_id from mycte where last_end_date >= begin_date_A and last_end_date <= end_date_A and end_date_A_rn = 1 )sub where begin_date <> end_date order by customer_id, begin_date |
The final output and expected results are above, in blue. Notice that each scenario is accounted for, which includes end dates that are both current and in infinite. Too, it’s important to call out, the begin date is the same as the previous end date. Depending on your architecture, you may have end dates which are one day less to allow inclusive “between” clauses. To solve for this you’ll need to play around with subtracting a day in the CTE dataset, which is trivial to implement.
* * *
This was a very fun and challenging project and I’m writing this post for posterity and maybe it will help someone else out in the future. Hopefully, there are no bugs in this code as I’ve tested this on billions of rows of data but, I suppose, there could be more scenarios I didn’t think of or possibilities not found in my production dataset… however, I don’t think so 😉