How To Scrape Links From Reddit Comments

I recently had the need to retrieve all the links from a particular subreddit, consolidate them, and keep a running list of new links that get added. Knowing how to scrape links from Reddit comments versus a Reddit’s posting is what this article is about. While I won’t go into the detail of storing the links into a database, I want to instead focus on the meat of the python script:

How to scrape links from Reddit Comments?

Python script used to scrape links from subreddit comments.
Python script used to scrape links from subreddit comments.

First, you need to understand that Reddit allows you to convert any of their pages into a JSON data output. You can do this by simply adding “.json” to the end of any Reddit URL. Secondly,  by exporting a Reddit URL via a JSON data structure, the output is limited to 100 results. However, by default the maximum fetch size is 25 results. Furthermore, you need the ability to fetch the next 100 results and so on. This is where the magic happens!

I should mention there is a really cool Python library called PRAW (Python Reddit API Wrapper) that makes using the Reddit API super simple and also makes use of scraping links off pages pretty easy too – but I wanted a simpler solution that didn’t require any bulky frameworks or API calls.

Looping over Reddit’s 100 post JSON limit

Let’s start by looking at how to retrieve an entire subreddit’s comments, pass the proper connection information, and loop through each page without getting blocked:

Within the JSON schema, which I’ll describe in more detail below, there is a field called “After” which is used to let Reddit know what the last post’s unique identifier is and where to start retrieving the next set of 100 posts. If you don’t define an “After”, or just leave it blank/null, Reddit will ignore the switch and deliver the first 100 posts.

So, the idea is, you grab the first 100 then figure out what the last identifier is and loop over that until the identifier returns NULL, or in Python-speak “None”. When the “After” variable returns “None” the loop will break.

This script uses Requests, an HTTP Request library for Python. If you’re not familiar, it is a great, standard library to familiarize yourself with.

Additionally, notice that I have a sleep timer at the bottom of the loop. I’m not 100% certain if this is necessary, but I’ve successfully scripted a loop like this for many, many hours without being blacklisted. Therefore, I think it’s a good idea 😉

How does Reddit’s  comments JSON schema work?

Let’s take a look at the SQL subreddit and view the first 100 comments to every post in that subreddit. Clik here to see the goods. From that output, let’s look at one comment within the whole schema to get an idea of how the structure works:

Notice there is a wealth of information about just a single comment made to a particular post on the r/SQL subreddit. Some particularly good bits of data are the “Body” tag, the “Link_Title”, and the “Score”. With those three fields, you can create a serious compilation of metadata.

Also, as mentioned earlier, notice the “after” field that is always located at the bottom of the JSON and within the first hierarchy of the ‘data’ nest. A keen observer will see that the “after” value is the same as the last comment’s “name” value.

How to retrieve links from the subreddit comment’s body field?

This piece is a bit tricky to explain but, the script uses Regex to locate the string of text that resembles a link and a bit of hacking to get to the proper nested element within the JSON output. Let’s take a look:

The real magic there is this bit: “re.findall(‘(\S+://+\S*)’,text)”. If you want to see this in action, take a look here at this RegEx Tool, which I highly recommend using if you don’t already know about this. There are a bit of other hackery I have to do to cleanse the data; I tried to clean up any carriage returns, white space, ticks and quote marks, and other odd things I saw when scraping the posts. You can modify this though to suit your needs.

Now, let’s see the script in its entirety:

 

Hope you find this helpful. If you have any comments or any suggestions, leave a comment below!