How To Scrape Links From Reddit Comments

I recently had the need to retrieve all the links from a particular subreddit, consolidate them, and keep a running list of new links that get added. Knowing how to scrape links from Reddit comments versus a Reddit’s posting is what this article is about. While I won’t go into the detail of storing the links into a database, I want to instead focus on the meat of the python script:

How to scrape links from Reddit Comments?

Python script used to scrape links from subreddit comments.
Python script used to scrape links from subreddit comments.

Continue reading →

Catch-All Join To A Lookup Dimension

I recently ran into an interesting problem that I’d like to share and show how I resolved it. The solution involves a catch-all join to a lookup dimension table.

ERD Diagram of wildcard lookup status table.
ERD Diagram of wildcard lookup status table.

Imagine having many employees that work in many departments. Each department has their own way of determining the employee’s status; Some departments use the status code that was given in the source system, other departments rely solely on the department they’re from and others use a combination of both! Oh yeah, the fun bit, this status logic can change…

Continue reading →

Comma Separator Before Field Name Or After?

Before! The comma separator before field name is always preferred. There, that was easy.

You’ve come here to either win an argument with a coworker — in which case I hope you’re here to find proof for having the comma separator before field names, or you’re doing it wrong — or you’re here to learn. In either case, the comma comes before field names. So, allow me to justify when and, more importantly, why I use one variation over the other:

Example of both comma separators

You’ll notice that both these queries are very much identical, with the exception of the placement of the delimiting comma between each field in the select clause of course.

There are two schools of thought here:
Continue reading →

Generate SQL with SQL?

I’m aware that many of my readers of this site likely already know how to write sql to generate sql, or sql generators, but it’s not really much of a data blog if I don’t mention it. It’s also important to mention the time value this trick represents, with a few minutes of time, you can generate thousands of lines of code that would otherwise take hours to write. Let’s chat about one of the first things I learned about SQL which, no doubt, blew my mind:

Generate SQL with SQL.

Generate SQL with SQL via database commands.
Generate SQL with SQL via database commands.

Or to put another way, generating SQL statements via a single SQL command. Really cool stuff if you haven’t seen this before, keep reading!

Let’s say you wanted to find the number of badges for each type of badge in a table of badges. This is quite easy with a simple query, but let’s see how we can do it by writing SQL with SQL:

Continue reading →

3 SQL Tricks To Use When Query Building

I write hundreds of lines of SQL every day, and that’s just for my day job. This repetition has allowed me to realize some exceptional tricks. Simple SQL tricks which help me write cleaner code, code that is easier to troubleshoot, guarantees more accurate data sets and is easier to manage as the query grows. So, here are three SQL tricks that I consistently use in every query I write:

SQL Tricks and Tips to use in every day query writing.
SQL Tricks and Tips to use in every day query writing.

SQL Tricks WHERE 1 = 1

This is one of my favorite SQL tricks that always causes people confusion when looking at my queries. I’ll reiterate this again later, but this is NOT a production ready trick, this is merely a tool to help make troubleshooting easier.

How does the WHERE 1=1 trick work?
Continue reading →

Distinct Vs Group By: Which is really better?

Let’s chat about the distinct vs group by clause: I often see the misuse of GROUP BY with no aggregates when a DISTINCT would suffice. However, I more often see the misuse of a DISTINCT when a GROUP BY is more practical. I wanted to talk about some, maybe not-so-obvious, reasons to use one over the other. I won’t go into what the functional differences are, but more so the practical differences.

But first, is there really a technological difference between using DISTINCT vs a GROUP BY (sans aggregates, of course)? The quick answer is that it depends on the system, there could be differences between the optimizer on one of the many DBMS out there. However, in my tests, the likely more accurate answer is simply, No.

These two queries not only produce the exact same result set, they produce the exact same query plans too. To see for yourself, take a look here. (Tested on SQL Server, Oracle, and Teradata)

Query plan of distinct vs group by
Query plan showing the similarities between a distinct and a group by clause.

Continue reading →

You’re Doing It Wrong. Hopefully, I’m not.

This blog will be my virtual valet of knowledge that oozes out from time to time. I wanted a place that I could share my thoughts as my career advances and confidently build a platform to represent who I am, what I do and my thoughts on the subject of data & analytics.

Throughout my storied career in IT, I’ve made lots of mistakes. It’s those mistakes that have made me the arrogant jackass subject matter expert I am today. Working with seasoned rock stars, high-priced consultants, and everyone in between (You see what I did there?). So, it’s not a problem that you’re doing it wrong; the problem is that hopefully, I’m not.

The posts you’ll read on this site will be about all the dialogue that comes from us trying our best, failing early, and failing often.

Hope you enjoy it. Now, it’s time to fill this mug with some content!