Recently, I had a situation where there were a multitude of ways a set of events could occur to a particular data point over time – where I needed to know what every possible pattern of events had occurred to that single data point. In this post, I’ll walk through the scenario of when you’d want to do something like this and how find patterns in a dataset.
Imagine you have a source system which allows a customer to interact with your front-end application, like updating their profile, and you want to know how they update their profile and all the different patterns in which they go about interacting with your system. To do that, you’ll need to recursively join to your data and build that pattern, in a set-based way, achieving superior performance with a very large dataset.
Let’s look at the data:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
CREATE TABLE seed ( id INTEGER ,status VARCHAR(100) ); INSERT INTO seed SELECT 1, 'A'; INSERT INTO seed SELECT 1, 'B'; INSERT INTO seed SELECT 1, 'C'; INSERT INTO seed SELECT 2, 'A'; INSERT INTO seed SELECT 2, 'E'; INSERT INTO seed SELECT 3, 'E'; INSERT INTO seed SELECT 3, 'F'; INSERT INTO seed SELECT 3, 'G'; INSERT INTO seed SELECT 3, 'H'; |
Yup, this example is in MySQL 8…
So, looking at the data set above, we can see that the pattern for ID #1 is “A, B,C” and the pattern for ID #2 is “A,E”. But first, we have to prepare the data and give it some sort of order to work with. Typically, you’d want to order this data by some sort of event-indicator, like a timestamp; For this example, we’ll just use the order of the alphabet. Take a look:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
CREATE TABLE water ( id INTEGER ,status VARCHAR(100) ,row_id INTEGER ); INSERT INTO water SELECT id ,status ,ROW_NUMBER() OVER(PARTITION BY id ORDER BY status) row_id FROM seed ; |
Recursively Find Patterns In A Dataset:
Next, we want to use a recursive CTE (Common Table Expression) to join back to itself to concatenate the previous pattern to the next row of data. So, if we look at ID #1 and it’s 3 rows of data. The first row’s pattern is “A”, the second row is “A,B” and the third and final row is “A,B,C” — which is the pattern we’re looking for at this datapoint. Here’s the query to achieve that:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
WITH RECURSIVE plant AS( SELECT id ,status ,row_id FROM water WHERE row_id = 1 UNION ALL SELECT w.id ,CONCAT(plant.status,',',w.status) as status ,w.row_id FROM water w JOIN plant ON w.id = plant.id AND w.row_id = plant.row_id +1 ) SELECT id ,status as status_pattern FROM( SELECT id ,status ,ROW_NUMBER() OVER(PARTITION BY id ORDER BY row_id DESC) AS filter FROM plant )output WHERE filter = 1 ORDER BY 1 ; |
To get this to work on Postgres, simply change the CONCAT() to ||
The output of the above query will return three records for each of the three IDs in our example, along with its corresponding pattern. Below is the output and if you click here, you can find a Fiddle with this entire example.
0 1 2 3 4 5 6 |
id status_pattern -- -------------- 1 A,B,C 2 A,E 3 E,F,G,H |
Hope someone finds this helpful.