When dealing with transactional data often there are many levels of granularity lying within. Finding these granularities exposes how your data is shaped as it accumulates and helps paint a better picture of what I like to call Lifes within the data. In this post, I want to share a technique I use to find data patterns which will be beneficial for everyone from the analyst to the architect.
Why do I refer to these data patterns as Lifes?
I haven’t found anything transactional in nature that doesn’t have some sort of recurring theme, with a distinct beginning and end, that couldn’t tell a story. It is these finite beginning/ends, start/stops, on/offs that paint the picture that is the “life” of the data. The life of these stories often have many sub-narratives and are interwoven within a single holistic life of the data. A great example is the familiar case of a customer purchase history. The customer is the holistic life of the data, their purchase orders, individual line items, and even a particular line item purchased over several purchase orders are all examples of sub-narratives within the story of a single customer.
What’s the method to find patterns in data?
The most common method is the ROW_NUMBER window function, but RANK and DENSE_RANK are valuable tools to explore canonical data as well. The best way to explain the method is to show it! The following query uses Microsoft’s World Wide Importers Data Warehouse as our dataset:
ROW_NUMBER() OVER(PARTITION BY [CUSTOMER KEY] ORDER BY [INVOICE DATE KEY], [SALE KEY] ASC) AS CUSTOMER_LIFE
,ROW_NUMBER() OVER(PARTITION BY [CUSTOMER KEY], [WWI INVOICE ID] ORDER BY [INVOICE DATE KEY], [SALE KEY] ASC) AS INVOICE_LIFE
,ROW_NUMBER() OVER(PARTITION BY [CUSTOMER KEY], [WWI INVOICE ID] ORDER BY [INVOICE DATE KEY], [SALE KEY] DESC) AS INVOICE_LIFE_REVERSED
,DENSE_RANK() OVER(PARTITION BY [CUSTOMER KEY] ORDER BY [WWI INVOICE ID]) AS INVOICE_COUNT
,SUM([PROFIT]) OVER(PARTITION BY [CUSTOMER KEY], [WWI INVOICE ID] ORDER BY [INVOICE DATE KEY], [SALE KEY] ASC ROWS UNBOUNDED PRECEDING) AS INVOICE_RUNNING_PROFIT
,SUM([TOTAL INCLUDING TAX]) OVER(PARTITION BY [CUSTOMER KEY], [WWI INVOICE ID] ORDER BY [INVOICE DATE KEY], [SALE KEY] ASC ROWS UNBOUNDED PRECEDING) AS INVOICE_RUNNING_TOTAL
FROM [Fact].[Sale] S
WHERE [CUSTOMER KEY] > 0
ORDER BY [CUSTOMER KEY], [INVOICE DATE KEY], [SALE KEY] ASC;
Example output of data:
The bold data are the manufactured pieces we extracted from the data to show the different data-lives hidden within. Take a moment to study the first few columns of data…
From the above data we can find three Lifes and a useful fact from the data, let’s break it down:
- Customer Life — The numbering represents every line item of purchases throughout all time for each customer. The numbering starts again at 1 when a new customer life begins.
- Invoice Life — This numbering pattern is a Life of a customer’s invoice. Each customer has many invoices and the numbering starts again at 1 when a new invoice life begins.
- Invoice Life Reversed — This is the same concept as Invoice Life above instead, we’re reversing the numbering to allow us to find the last invoice record by simply selecting where = 1 while finding the first record can be done via the Invoice Life.
- Invoice Count — This is a fact detail that uses the DENSE_RANK window function to find at any given line-item record which invoice number it is. Just another way to group by invoice or find a count by a customer, or any life for that matter.
I’m certain there are more patterns we can extract from the same data set, but the point here is to illustrate a method I use regularly when examining data to find patterns and grains. Hopefully this is a new technique and one that you can implement and add to your arsenal.
Come back and leave a comment with your thoughts!