I recently had the need to retrieve all the links from a particular subreddit, consolidate them, and keep a running list of new links that get added. Knowing how to scrape links from Reddit comments versus a Reddit’s posting is what this article is about. While I won’t go into the detail of storing the links into a database, I want to instead focus on the meat of the python script:
How to scrape links from Reddit Comments?
Python script used to scrape links from subreddit comments.
Teradata’s Qualify clause which, at the time of writing this and probably for the foreseeable future, only exists in Teradata, and is spectacularly awesome! Also, so are Koalas.
Meme of a koala poking fun at Teradata’s Qualify By clause.
I recently ran into an interesting problem that I’d like to share and show how I resolved it. The solution involves a catch-all join to a lookup dimension table.
ERD Diagram of wildcard lookup status table.
Imagine having many employees that work in many departments. Each department has their own way of determining the employee’s status; Some departments use the status code that was given in the source system, other departments rely solely on the department they’re from and others use a combination of both! Oh yeah, the fun bit, this status logic can change…
Before! The comma separator before field name is always preferred. There, that was easy.
You’ve come here to either win an argument with a coworker — in which case I hope you’re here to find proof for having the comma separator before field names, or you’re doing it wrong — or you’re here to learn. In either case, the comma comes before field names. So, allow me to justify when and, more importantly, why I use one variation over the other:
Example of both comma separators
Example of a comma before & after the field name.
PgSQL
0
1
2
3
4
5
6
7
8
9
10
SELECT
TBL.FOO,--<----After
TBL.BAR
FROMDUAL;
SELECT
TBL,FOO
,TBL.BAR--<-----Before
FROMDUAL;
You’ll notice that both these queries are very much identical, with the exception of the placement of the delimiting comma between each field in the select clause of course.
I’m aware that many of my readers of this site likely already know how to write sql to generate sql, or sql generators, but it’s not really much of a data blog if I don’t mention it. It’s also important to mention the time value this trick represents, with a few minutes of time, you can generate thousands of lines of code that would otherwise take hours to write. Let’s chat about one of the first things I learned about SQL which, no doubt, blew my mind:
Generate SQL with SQL.
Generate SQL with SQL via database commands.
Or to put another way, generating SQL statements via a single SQL command. Really cool stuff if you haven’t seen this before, keep reading!
Let’s say you wanted to find the number of badges for each type of badge in a table of badges. This is quite easy with a simple query, but let’s see how we can do it by writing SQL with SQL:
Writing SQL with SQL example INPUT.
PgSQL
0
1
2
3
4
5
6
SELECT'SELECT CAST('''
+BADGES.NAME
+'''AS VARCHAR(50)), COUNT(1) FROM BADGES WHERE BADGES.NAME = '''
I write hundreds of lines of SQL every day, and that’s just for my day job. This repetition has allowed me to realize some exceptional tricks. Simple SQL tricks which help me write cleaner code, code that is easier to troubleshoot, guarantees more accurate data sets and is easier to manage as the query grows. So, here are three SQL tricks that I consistently use in every query I write:
SQL Tricks and Tips to use in every day query writing.
SQL Tricks WHERE 1 = 1
This is one of my favorite SQL tricks that always causes people confusion when looking at my queries. I’ll reiterate this again later, but this is NOT a production ready trick, this is merely a tool to help make troubleshooting easier.
Let’s chat about the distinct vs group by clause: I often see the misuse of GROUP BY with no aggregates when a DISTINCT would suffice. However, I more often see the misuse of a DISTINCT when a GROUP BY is more practical. I wanted to talk about some, maybe not-so-obvious, reasons to use one over the other. I won’t go into what the functional differences are, but more so the practical differences.
But first, is there really a technological difference between using DISTINCT vs a GROUP BY (sans aggregates, of course)? The quick answer is that it depends on the system, there could be differences between the optimizer on one of the many DBMS out there. However, in my tests, the likely more accurate answer is simply, No.
Distinct vs Group By (sans aggregates)
PgSQL
0
1
2
3
SELECTDISTINCTAGEFROMUSERS;
SELECTAGEFROMUSERSGROUP BYAGE;
These two queries not only produce the exact same result set, they produce the exact same query plans too. To see for yourself, take a look here. (Tested on SQL Server, Oracle, and Teradata)
Query plan showing the similarities between a distinct and a group by clause.
This blog will be my virtual valet of knowledge that oozes out from time to time. I wanted a place that I could share my thoughts as my career advances and confidently build a platform to represent who I am, what I do and my thoughts on the subject of data & analytics.
Throughout my storied career in IT, I’ve made lots of mistakes. It’s those mistakes that have made me the arrogant jackass subject matter expert I am today. Working with seasoned rock stars, high-priced consultants, and everyone in between (You see what I did there?). So, it’s not a problem that you’re doing it wrong; the problem is that hopefully, I’m not.
The posts you’ll read on this site will be about all the dialogue that comes from us trying our best, failing early, and failing often.
Hope you enjoy it. Now, it’s time to fill this mug with some content!