I recently had the need to retrieve all the links from a particular subreddit, consolidate them, and keep a running list of new links that get added. Knowing how to scrape links from Reddit comments versus a Reddit’s posting is what this article is about. While I won’t go into the detail of storing the links into a database, I want to instead focus on the meat of the python script:
How to scrape links from Reddit Comments?

First, you need to understand that Reddit allows you to convert any of their pages into a JSON data output. You can do this by simply adding “.json” to the end of any Reddit URL. Secondly, by exporting a Reddit URL via a JSON data structure, the output is limited to 100 results. However, by default the maximum fetch size is 25 results. Furthermore, you need the ability to fetch the next 100 results and so on. This is where the magic happens!
I should mention there is a really cool Python library called PRAW (Python Reddit API Wrapper) that makes using the Reddit API super simple and also makes use of scraping links off pages pretty easy too – but I wanted a simpler solution that didn’t require any bulky frameworks or API calls.
Looping over Reddit’s 100 post JSON limit
Let’s start by looking at how to retrieve an entire subreddit’s comments, pass the proper connection information, and loop through each page without getting blocked:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
while after is not None: #Read reddit's json file url = 'http://www.reddit.com/r/' + subreddit + '/comments.json?limit=100&after=' + after headers = {'user-agent': user_agent} json = requests.get(url, headers=headers).json() #Set json starting point in URL after = json['data']['after'] #Wait between 10 and 15 seconds to prevent reddit from blocking the script time.sleep(10 + randint(0,5)) |
Within the JSON schema, which I’ll describe in more detail below, there is a field called “After” which is used to let Reddit know what the last post’s unique identifier is and where to start retrieving the next set of 100 posts. If you don’t define an “After”, or just leave it blank/null, Reddit will ignore the switch and deliver the first 100 posts.
So, the idea is, you grab the first 100 then figure out what the last identifier is and loop over that until the identifier returns NULL, or in Python-speak “None”. When the “After” variable returns “None” the loop will break.
This script uses Requests, an HTTP Request library for Python. If you’re not familiar, it is a great, standard library to familiarize yourself with.
Additionally, notice that I have a sleep timer at the bottom of the loop. I’m not 100% certain if this is necessary, but I’ve successfully scripted a loop like this for many, many hours without being blacklisted. Therefore, I think it’s a good idea 😉
How does Reddit’s comments JSON schema work?
Let’s take a look at the SQL subreddit and view the first 100 comments to every post in that subreddit. Clik here to see the goods. From that output, let’s look at one comment within the whole schema to get an idea of how the structure works:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
{ "kind":"Listing", "data":{ "modhash":"", "children":[ { "kind":"t1", "data":{ "subreddit_id":"t5_2qp8q", "link_title":"[MS SQL] How worthless is it that MSSQL Management Studio only auto-recovers scripts that you've already saved?", "banned_by":null, "removal_reason":null, "link_id":"t3_4zpmdc", "link_author":"mcraamu", "likes":null, "replies":"", "user_reports":[ ], "saved":false, "id":"d6xrooj", "gilded":0, "archived":false, "stickied":false, "author":"Eldarial", "parent_id":"t3_4zpmdc", "score":8, "approved_by":null, "over_18":false, "report_reasons":null, "controversiality":0, "body":"SQL Prompt saves the most recent version of everything you've ever closed, regardless of whether or not you saved it. It's not worth free but definitely worth the money", "edited":false, "author_flair_css_class":null, "downs":0, "body_html":"<div class=\"md\"><p>SQL Prompt saves the most recent version of everything you&#39;ve ever closed, regardless of whether or not you saved it. It&#39;s not worth free but definitely worth the money</p>\n</div>", "quarantine":false, "subreddit":"SQL", "score_hidden":false, "name":"t1_d6xrooj", "created":1472261840.0, "author_flair_text":null, "link_url":"https://www.reddit.com/r/SQL/comments/4zpmdc/ms_sql_how_worthless_is_it_that_mssql_management/", "created_utc":1472233040.0, "ups":8, "mod_reports":[ ], "num_reports":null, "distinguished":null } } ], "after":"t1_d6xrooj", "before":null } } |
Notice there is a wealth of information about just a single comment made to a particular post on the r/SQL subreddit. Some particularly good bits of data are the “Body” tag, the “Link_Title”, and the “Score”. With those three fields, you can create a serious compilation of metadata.
Also, as mentioned earlier, notice the “after” field that is always located at the bottom of the JSON and within the first hierarchy of the ‘data’ nest. A keen observer will see that the “after” value is the same as the last comment’s “name” value.
How to retrieve links from the subreddit comment’s body field?
This piece is a bit tricky to explain but, the script uses Regex to locate the string of text that resembles a link and a bit of hacking to get to the proper nested element within the JSON output. Let’s take a look:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
#... json = requests.get(url, headers=headers).json() output = [json['data']['children']] #Do some nasty cleansing to find the data we need for child in output: for post in child: text = post['data']['body'].strip(' \t\n\r').replace(')', ' ').replace('(', ' ') text = str(re.findall('(\S+://+\S*)',text)) text = text.replace('[', ' ').replace(']', ' ').replace("'", ' ').strip() if text != '': text = text.split(',') for t in text: t = t.strip(' \t\n\r') if t not in data: data.append(t) |
The real magic there is this bit: “re.findall(‘(\S+://+\S*)’,text)”. If you want to see this in action, take a look here at this RegEx Tool, which I highly recommend using if you don’t already know about this. There are a bit of other hackery I have to do to cleanse the data; I tried to clean up any carriage returns, white space, ticks and quote marks, and other odd things I saw when scraping the posts. You can modify this though to suit your needs.
Now, let’s see the script in its entirety:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
from random import randint import requests, json, re, time #Parameters subreddit = 'SQL' #Subreddit of choice reddituser = 'foo' #Your Reddit username user_agent = 'Python.Find_' + subreddit + ':v0.1 (by /u/' + reddituser + ')' after = '' data = list() while after is not None: #Read reddit's json file url = 'http://www.reddit.com/r/' + subreddit + '/comments.json?limit=100&after=' + after headers = {'user-agent': user_agent} json = requests.get(url, headers=headers).json() output = [json['data']['children']] #Do some nasty cleansing to find the data we need for child in output: for post in child: text = post['data']['body'].strip(' \t\n\r').replace(')', ' ').replace('(', ' ') text = str(re.findall('(\S+://+\S*)',text)) text = text.replace('[', ' ').replace(']', ' ').replace("'", ' ').strip() if text != '': text = text.split(',') for t in text: t = t.strip(' \t\n\r') if t not in data: data.append(t) #Set json starting point in URL after = json['data']['after'] #Wait between 10 and 15 seconds to prevent reddit from blocking the script time.sleep(10 + randint(0,5)) #Print the data for d in data: print(d) |
Hope you find this helpful. If you have any comments or any suggestions, leave a comment below!