Referential Integrity in the Data Warehouse is a controversial topic amongst BI Professionals; you’re either all for it or all against it. While keeping the focus on RI, solely with regard to the Data Warehouse, I want to discuss some pros and cons to help you make a better, more informed decision that will have long-lasting impacts to your warehouse, development lifecycle, and production maintenance.
You need to understand that referential integrity is a decision, not a standard. Blindly implementing RI because you’re “supposed to” is a presumptuous, neglectful response to a complex problem. I think that everyone will agree that leaving constraints at the application layer is an apprentice mistake, nevertheless, when it comes to data it may be our best choice. Yet, still, having table-driven constraints is always preferred — if it makes sense!
When does it make sense to have Referential Integrity?
If you can constrain, you should constrain. Constrain, constrain, constrain! Just one sloppy mistake in the warehouse can be saved by using RI’s Foreign Key constraints. By trying to delete data in one table, or even erroneously dropping the table altogether, RI will use the Foreign Key relationship and constrain these types of changes from happening. It is a job saver! But the deletion of data isn’t the only concern, often performance is a factor.
Take SQL Server, for instance, it doesn’t just enforce RI, it actually feeds this knowledge into the database optimizer. Meaning, certain queries can be more efficient when RI is in place. So, while you will take a performance degradation on “writes”, you’ll receive a performance gain on “reads”; again, with data warehousing as our focus, having slow writes and fast reads is an OK trade-off.
What are the reasons not to enforce Referential Integrity?
This time, let’s take Teradata as our example. In Teradata, if you define a constraint by using REFERENCES, then the keys will be validated for every record that is inserted, updated, or deleted! By implementing the less impeding REFERENCES WITH CHECK OPTION then it will only validate on “commit”, this is a more performant option but still a heavy load when dealing with millions of rows of data. Further, there is a third option, the constraint REFERENCES WITH NO CHECK OPTION implements an RI methodology into the database allowing the optimizer to assume that the constraints are being met but there are no validity checks that occur; you can still freely update and delete your referenced data 🙁 (but it does prevent dropping!). With the ‘no check option,’ if your data actually violates the constraint, you won’t get an error but the optimizer may not return correct results.
When dealing with a Data Warehouse, having fast reads is the name of the game. The decision you make should always take performance as its primary driver. But there is more, by implementing RI you are also handicapping your development team. You’re essentially putting handcuffs on your development and support people.
Naturally, the people for RI are the DBA’s and Architects of the bunch and the people against RI are the developers. The enforcers get to slap the handcuffs on the developers and they pay the price every time they have to maintain or modify the warehouse. Take this as an example: Often times while developing I will add columns, remove columns, reorder my fields to make more sense – and by implementing a constraint on a table or fields like this, it makes it an impossible task to quickly make these kinds of changes. During new development, to add a key column to the top of the table, to maintain column order for cardinality, the only way to do this is to drop and recreate the table; RI prevents this and now requires the developer to drop the referencing table first, and the table that references it, and the table that references it, and the… you get the idea. What a nightmare!
Questions to ask the team before implementing Referential Integrity?
Q: Will implementing RI improve or impede my performance?
A: Research your RDBMS and comb through all the options that are available, in this post we’ve covered two major systems but there are many others. Once you’ve found the potential options, test them. Proof your results and share with your team to help make the best decision.
Q: Does our application layer provide adequate data integrity checks?
A: With regards to the Data Warehouse, this is typically the ETL that maintains your integrity when no RI is in place. Having a solid framework, one that looks for duplicates during load, has restart-ability, and can help flag potential errors, is a good step in the right direction.
Q: What will the downstream impact be to our development and support lifecycles?
A: Taking consideration of your development and support teams is a huge resource saver. Weighing the increased time that development and support will need to do their jobs is important; but, so is weighing the time that will be saved by constraining your data! Often, it’s smart to look at what data is being constrained, is it something that can’t easily be rebuilt or restored from backup, or is it a highly touched type 2 dimension with critical information? Do your homework.
Simply put: test and question everything. RI is an important decision and researching and testing your solutions, even as you continue to implement, is key.