The topic of dynamic dimension descriptions isn’t new, but the method for which I’ve come up with is a hybrid of several other methodologies, making for a great subject piece. Using codes and descriptions in your warehouse dimensions is standard practice, even in some cases the use-case may require displaying a code versus a description, for example: displaying USA instead of United States of America, or even in situations where the business understands what a “TPS” report type is more so than the formal “Testing Procedure Specification” description. In any case, you’ll want a method that enables the ability to add and modify additional codes and their descriptions without having to perform risky manual updates to the warehouse.
In this post, I will discuss a rigid yet tolerant way for properly implementing dynamic dimension descriptions without ever directly modifying a warehouse dimensional table. Instead, we’ll implement a lookup table that an end user can insert and update freely, along with a robust ETL process that uses this same lookup table to perform description updates and even type 2 historical tracking, if necessary.
Dynamic Dimension Descriptions Using An ETL Lookup Table
In the above diagram, you’ll notice that the ETL is performing two new, additional tasks that are not part of a typical dimensional load; typically, when dealing with robust source systems, the source will contain the code along with a proper description – at which point, you’d simply insert/update and type1/type2 as the business requires – however, in our all-to-common scenario, we have to come up with a way to handle manually inserting descriptions at a later time.
So, as new codes come through the source system, meaning a code was found that is NOT in the dimension, the new code is inserted into two tables, the dimension, and our new ETL_LOOKUPS table. It is important to note three important things when a new code comes through:
- Only insert into the target table if the code is not already there.
- When a new code is found, a temporary description must be set as to not confuse our end users at the presentation layer.
- Notify stakeholders of the new code.
Let’s break down each of the above three things:
Firstly, depending on the order in which the new code is loaded, sometimes we may load a code into the ETL_LOOKUPS table before a new code comes through the source system, so we don’t want to insert duplicates. Secondly, when a new code is found, you’ll want to choose a description that is unlike other descriptions, like “ETL UNKNOWN”; simply using “UNKNOWN” is not a smart choice as there could be legitimate unknown descriptions. And finally, it is imperative to notify the stakeholders of a new code being inserted into the warehouse; typically this is done via the reporting layer with reports generating for descriptions of “ETL UNKNOWN” or, in other cases, the ETL itself will generate emails — no matter the method, it’s important to notify stakeholders.
Dynamic Dimension Descriptions Using An Excel Source File
The above diagram is an extension of the previous one, exposing the addition of an Excel Source File which allows a business user to modify a single file to add and modify dimension descriptions. This is the key concept to eliminating manual intervention of ETL resources updating tables on the fly. Instead, we give the power of presentation layer descriptions to the business users, allowing for a dynamic way to update our warehouse. There are three important things to remember, though:
- The excel sheet should generate a CSV file of itself to use as an input.
- Breaking down codes & descriptions by a subject-area or table-name is key to keeping this dynamic.
- Inserts ONLY. No updating or deleting shall occur.
Let’s take each of the above three things one at a time, in more detail:
One: The excel sheet itself can be used as the source file, but copy/paste issues will almost certainly occur and cause unreliable results in an already picky ETL ecosystem. I prefer to use excel VBA macros and, with a simple click of a button, auto-generate a file while placing it in the proper directory for the ETL to consume. Still, regardless of your method of choice, I recommend the ETL archiving the source files for auditing purposes; too often mistakes occur and having the paper trail makes life easier!
Two: You should notice that the ETL_LOOKUP table has an extra field to delineate a table’s name or a subject area. This is very important because two different dimensions could have the same codes and vastly different descriptions. Often times we’ll conform several sources into a single dimension, this conforming can mean that in one system the description of an object could have a different code than the same object in another system. To solve this we need a more granular method of storing codes and their descriptions.
Lastly, it’s important to remember that when a human is introduced then we must account for extreme error. Imagine a scenario where a description is updated for a particular dimension several times over the course of a few weeks, also during this time the dimension has itself gone through several type 2 changes. How do we know which update of a description affected which historical version? This helps mitigate that; by using type2 logic, our system will record individual updates to a code’s description which will make rolling-back changes not just easier, but possible!
A High-Level Explanation & Takeaways
The concept here is that we want an ETL process, when loading the dimension with data, where the ETL itself handles looking up a code description and updating itself as it processes. Sure, the first time a code is entered we’ll have bad descriptions. And yes, the descriptions won’t be updated until the next time that dimension’s ETL process runs. But, if we want to put the power of updating core meaningfulness within our warehouse back into the hands of our users, then it’s all worth it.
I find this topic interesting and am always excited to hear how other people solve this problem – please leave your comments below.