Referential Integrity in the Data Warehouse is a controversial topic amongst BI Professionals; you’re either all for it or all against it. While keeping the focus on RI, solely with regard to the Data Warehouse, I want to discuss some pros and cons to help you make a better, more informed decision that will have long-lasting impacts to your warehouse, development lifecycle, and production maintenance.
You need to understand that referential integrity is a decision, not a standard. Blindly implementing RI because you’re “supposed to” is a presumptuous, neglectful response to a complex problem. I think that everyone will agree that leaving constraints at the application layer is an apprentice mistake, nevertheless, when it comes to data it may be our best choice. Yet, still, having table-driven constraints is always preferred — if it makes sense!
When does it make sense to have Referential Integrity?
Using the Row NumberWindow Function to flag records if that row is the first occurrence in a series is a performant way to extract more value from your data. By using a Partition By clause we can group chunks of data together while ordering them to figure out which is the first in that series of data. Row Number, with the addition of the Over clause, allows us to achieve this without the use of a subquery by simply wrapping the function in a case statement.
An example of how to find the first occurrence in a series
I recently had the need to retrieve all the links from a particular subreddit, consolidate them, and keep a running list of new links that get added. Knowing how to scrape links from Reddit comments versus a Reddit’s posting is what this article is about. While I won’t go into the detail of storing the links into a database, I want to instead focus on the meat of the python script:
I recently ran into an interesting problem that I’d like to share and show how I resolved it. The solution involves a catch-all join to a lookup dimension table.
Imagine having many employees that work in many departments. Each department has their own way of determining the employee’s status; Some departments use the status code that was given in the source system, other departments rely solely on the department they’re from and others use a combination of both! Oh yeah, the fun bit, this status logic can change…
Before! The comma separator before field name is always preferred. There, that was easy.
You’ve come here to either win an argument with a coworker — in which case I hope you’re here to find proof for having the comma separator before field names, or you’re doing it wrong — or you’re here to learn. In either case, the comma comes before field names. So, allow me to justify when and, more importantly, why I use one variation over the other:
Example of both comma separators
Example of a comma before & after the field name.
You’ll notice that both these queries are very much identical, with the exception of the placement of the delimiting comma between each field in the select clause of course.
I’m aware that many of my readers of this site likely already know how to write sql to generate sql, or sql generators, but it’s not really much of a data blog if I don’t mention it. It’s also important to mention the time value this trick represents, with a few minutes of time, you can generate thousands of lines of code that would otherwise take hours to write. Let’s chat about one of the first things I learned about SQL which, no doubt, blew my mind:
Generate SQL with SQL.
Or to put another way, generating SQL statements via a single SQL command. Really cool stuff if you haven’t seen this before, keep reading!
Let’s say you wanted to find the number of badges for each type of badge in a table of badges. This is quite easy with a simple query, but let’s see how we can do it by writing SQL with SQL:
Writing SQL with SQL example INPUT.
+'''AS VARCHAR(50)), COUNT(1) FROM BADGES WHERE BADGES.NAME = '''
I write hundreds of lines of SQL every day, and that’s just for my day job. This repetition has allowed me to realize some exceptional tricks. Simple SQL tricks which help me write cleaner code, code that is easier to troubleshoot, guarantees more accurate data sets and is easier to manage as the query grows. So, here are three SQL tricks that I consistently use in every query I write:
SQL Tricks WHERE 1 = 1
This is one of my favorite SQL tricks that always causes people confusion when looking at my queries. I’ll reiterate this again later, but this is NOT a production ready trick, this is merely a tool to help make troubleshooting easier.