How F5Bot Slurps All of Reddit

ByLewis Van Winkle| July 30, 2018

In this guest post, Lewis Van Winkle talks aboutF5bot.那a free service that emails you when selected keywords are mentioned on Reddit, Hacker News, or Lobsters. He explains in detail how F5Bot is able to process millions of comments and posts from Reddit every day on a single VPS. You can check out more of Lewis Van Winkle’s writing atcodeplea.com那and his open source contributions

我经营免费服务,F5bot.。The basic premise is that you enter a few keywords to monitor, and it’ll email you whenever those keywords show up on Reddit or Hacker News. I’m hoping to add a few more sites soon.

This post is going to be about how I built the service. Specifically, we’re going to focus on scraping every post and comment from Reddit in real time.

F5Bot是用PHP编写的。它在一个非常的编程boring, straight-forward manner. I’m not doing anything special, really, but it does manage to scan every Reddit post and comment in real-time. And it does it all on a tiny low-end VPS. Sometimes people are surprised when I tell them this. “Reddit has huge volume,” they say, “Reddit runs a room-full of super computers, and they still have downtime like every day! How do you it?” Well, I’m going to tell you.

Reddit每天获得大约500,000个新员额和4,000,000个新评论。合并,约50 /秒。但是,帖子并没有以持续的速度进入,因此我们最好能够处理几百一秒钟。你可以认为PHP很慢,但它绝对可以处理这个问题。我的意思是在计算机上搜索几百个文件需要多长时间?真的,这里唯一的关键是我们需要以仔细的方式编写程序。我们不能浪费资源,但如果我们小心,我们将能够处理流量。

So let’s get started!

Reddit’s Listing API

Reddit提供了一个很好的JSON API。有一些人good documentation here。I’m pretty sure their site uses it internally to render the HTML pages for visitors. So we’ll just use that to grab the most recent posts. It’s at那and I recommend you open that up in your browser to follow along.

In PHP we’ll just grab it with a simple卷曲然后解码它json_decode()。(As a side note, I’m going to completely skip over error handling because it’s so boring. I mean it’s the most important part of running an actual real-live server, but it’s boring so I won’t talk about it anymore here.)

$ ch = curl_init();curl_setopt($ ch,curlopt_url,“);curl_setopt($ ch,curlopt_returntransfer,1);curl_setopt($ ch,curlopt_useragent,'testbot2000');/ *别忘了命名自己!* / $ posts = curl_exec($ ch);curl_close($ ch);$ posts = json_decode($ posts,true);print_r($帖子);


{“种类”:“列表”,“数据”:{“modhash”:“5b1dwrfbn4a9f8c200824114ECC7110EE2BAA4D95DC2C106B9”,“DIST”:25,“儿童”:[{“种类”:“t3”,“数据”:{“appleted_at_utc”:null,“supreddit”:“吉他”,“自我提议”:“我每天练习吉他。大多数人每天,我......”,“user_reports”:[],“保存”:false,“mod_reason_title”:false,“mod_reason_title”:null,“gilded”:0,“点击”:false,“title”:“[讨论]感谢这里的人”,“subreddit_name_prefixed”:“r / guitar”,“hidden”:false,“id”:false,“id”:false,“id”:false,“ID”:91sed5“,/ *更多键/值对。* /}},{“cind”:“t3”,“数据”:{“apploded_at_utc”:null,“fumreddit”:“gpdwin”,“selftext”:“购买单位的最便宜方式是什么。。。", "user_reports": [], "saved": false, "mod_reason_title": null, "gilded": 0, "clicked": false, "title": "Looking for advice on buying a WIN 2 in the UK", "subreddit_name_prefixed": "r/gpdwin", "hidden": false, "id": "91secm", /* Many more key/value pairs. */ }, /* Many more objects. */ } ], "after": "t3_91secm", "before": null } }

根据您具有PHP设置的方式,CURL可能会对安全的HTTPS连接遇到问题。如果是这样的情况,您可以绕过它curl_setopt($ ch,curlopt_sl_verifyPeer,false)

Now we have all the recent posts in$posts作为一个漂亮的清洁PHP阵列。这将是一篇短篇小文章,我们几乎已经完成了。


print("Downloaded " . count($posts['data']['children']) . " posts.\n");
Downloaded 25 posts.


... curl_setopt($ ch,curlopt_url,“);...打印(“下载”。计数($帖子['data'] ['儿童])。“帖子\ n”);


虽然有一个简单的解决方案。如果你仔细观察JSONyou’ll see that it has anfield with an ID. Thefield tells us which post id came之前这里最古老的帖子。所以我们真的节省了。我们可以再次调用API,但请询问此帖子ID的100个帖子!

$after = $posts['data']['after']; ... curl_setopt($ch, CURLOPT_URL, "$after"); ...

This works well up to a point. That point is about 1,000 posts. After that it will either loop back to the beginning and start showing you posts you’ve already seen, or it will just stop returning anything. I’m not a big fan of that, because if my scraper has a little down time I’d like it to go back and grab the old posts too, but maybe you’re not worried about that.

现在我在这里说过帖子的一切也是如此评论。唯一的区别是新的区别在于 of,他们更快地走了。

But we’ve got an even bigger problem. Each request takes a few seconds (or much longer if Reddit servers are loaded), and that means we can’t pull comments as quickly as they’re coming in. Remember, comments get posted at a rate of 50 per second, and it can be much more during peak traffic.

Making Multiple Simultaneous Connections

所以我们必须立刻制作一堆连接。PHP使这很容易curl_multifunctions, but we’re getting ahead of ourselves. If we’re just pulling the listings, how do we know the id to start the next request on if the current request hasn’t finished?

So each Reddit thing, like a post or comment, has a unique ID. Post IDs start withT3_and comment IDs start witht1_。The API is very inconsistent about whether it uses the prefix or not. Some places of the API require the prefix, but some just use the ID with no prefix.

IDS本身在Base 36中编码 - 因此它们使用十位数0-9,以及二十六个字母A-Z。在PHP中我们可以使用base_convert()to convert them to decimal and do math with them.


$ ch = curl_init();卷曲_setopt($ch, CURLOPT_URL, ""); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $posts = curl_exec($ch); curl_close($ch); $posts = json_decode($posts, true); $id = $posts['data']['children'][0]['data']['id']; $id = base_convert($id,36,10);

从中可以轻松地进行数学$ ID然后将其转换为Reddit的格式:base_convert($ ID,10,36)。这非常方便。

The best part is all the bandwidth you’re going to save by using base 36 instead of 10! You’re going to need it too, since almost the entire Reddit API consists of crap you’ll almost certainly never need. I mean do you really want削弱名称和削弱_name_prefixed? They’re the same, one just has an “r/” in front of it. In fact, take a look at that JSON again:。Basically I use the自我文本削弱永久链接URL.andtitle。其他95%的它只是浪费带宽。实际上,超过一半的回复是JSON关键文本。因此,即使您要使用每个值,您也可以在JSON密钥名称上花费更多的带宽而不是实际数据。哦,我猜它仍然比XML更好。

Anyway, now we can manipulate Reddit IDs, and it turns out that both comments and posts are assigned IDs more or less serially. They’re not strictly monotonic in the short term, but they are in the long term. (i.e. a post from an hour ago will have a lower ID than a post right now, but a post from 2 seconds ago may not)

因此,我们可以通过简单地从起始ID中减去100来找到启动每个API呼叫的位置。使用此方法我们可以立即下载一堆帖子。这是最新的1,000个帖子,同时下载。它假设您已将最新帖子的ID加载到$ ID(in decimal).

$ mh = curl_multi_init();$ curls = array();for($ i = 0; $ i <10; ++ $ i){$ ch = curl_init();$ url =“”。base_convert($ ID,10,36);打印(“$ url \ n”);curl_setopt($ ch,curlopt_url,$ url);curl_setopt($ ch,curlopt_returntransfer,1);curl_multi_add_handle($ mh,$ ch);$ curls [] = $ ch; $id -= 100; } $running = null; do { curl_multi_exec($mh, $running); } while ($running); echo "Finished downloading\n"; foreach ($curls as $ch) { $response_code = curl_getinfo($ch, CURLINFO_HTTP_CODE); $response_time = curl_getinfo($ch, CURLINFO_TOTAL_TIME); $posts = curl_multi_getcontent($ch); $posts = json_decode($posts, true); print("Response: $response_code, Time: $response_time, Post count: " . count($posts['data']['children']) . ", Starting id: " . $posts['data']['children'][0]['data']['id'] . "\n"); curl_multi_remove_handle($mh, $ch); } curl_multi_close($mh);

If you try to push that back past about 1,000 posts you’ll just get empty replies. Assuming that you’re ok with the 1,000 post limit, we’re done. We’ve got the API totally figured out, scraping is a solved problem, and you’re ready to finish building your bot.

Except… if you look at the post IDs you’re getting, you’ll find some funny business. Some posts are repeated. That’s not a big deal, we can ignore them, but also some posts are missing. That’s a big deal! We aren’t slurping all of Reddit if we’re missing posts.

丢失的文章是由IDs不引起的totally in order. Or maybe it’s just some other bug. I don’t know. In any case, there’s nothing we can do about it here that I know of. We’ll need another solution. The 1,000 post limit was really annoying anyway. And the 1,000 comment limit meant we were going to need to scrape every 10 seconds just to avoid missing anything. That wasn’t really viable long-term.


Reddit’s listing APIs are terrible for trying to see every post and comment. To be fair, they probably weren’t designed for that. I guess they are just designed to work with their website to render pages. No real human would want to keeping clicking past 1,000 posts.



Here’s how we do it. We find the starting post ID, and then we get posts individually from。我们需要将一个大帖子ID列表添加到该API URL的末尾。

So here are 2,000 posts, spread out over 20 batches of 100 that we download simultaneously. It assumes you’ve already got the last post ID loaded into$ IDbase-36.

print("Starting id: $id\n"); $urls = array(); for ($i = 0; $i < 20; ++$i) { $ids = array(); for ($j = 0; $j < 100; ++$j) { $ids[] = "t3_" . $id; $id = base_convert((base_convert($id,36,10) - 1), 10, 36); } $urls[] = "" . implode(',', $ids); } $mh = curl_multi_init(); $curls = array(); foreach ($urls as $url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_USERAGENT, "This is actually needed!"); print("$url\n"); curl_multi_add_handle($mh, $ch); $curls[] = $ch; } $running = null; do { curl_multi_exec($mh, $running); } while ($running); print("Finished downloading\n"); foreach ($curls as $ch) { $response_code = curl_getinfo($ch, CURLINFO_HTTP_CODE); $response_time = curl_getinfo($ch, CURLINFO_TOTAL_TIME); $posts = json_decode(curl_multi_getcontent($ch), true); print("Response: $response_code, Time: $response_time, Post count: " . count($posts['data']['children']) . ", Starting id: " . $posts['data']['children'][0]['data']['id'] . "\n"); curl_multi_remove_handle($mh, $ch); } curl_multi_close($mh);

我们最终会有一些很长的URL,如this那but Reddit doesn’t seem to mind.

另请注意,我们实际上需要为“info.json”API点设置用户代理。显然Reddit从中阻止了PHP /卷曲。为一切设置用户代理是一个很好的想法;我只省略了它以缩短代码样本。看到Reddit的API访问规则for more info.


完成下载响应:200,时间:5.735,帖子数:96,起始ID:90YV0W响应:200,时间:7.36,帖子数:99,启动ID:90YUY4响应:200,时间:6.532,帖子数:100,开始ID:90,时间:6.547,帖子数:99,启动ID:90 yuSK响应:200,时间:7.344,帖子数:99,启动ID:90 ups响应:200,时间:7.25,帖子数:94,起始ID:90YUN0响应:200,时间:6.672,帖子数:83,起始ID:90yuk8响应:200,时间:7.469,帖子数:97,启动ID:90YUHG响应:200,时间:6.344,帖子数:91,启动ID:90 yueo响应:200,时间:7.187,帖子数:97,起始ID:90 yubw响应:200,时间:22.734,帖子数:96,启动ID:90YU94响应:200,时间:7.453,帖子计数:93,启动ID:90YU6C响应:200,时间:7.359,帖子数:99,起始ID:90YU3K响应:200,时间:7.812,帖子数:100,启动ID:90YU0S响应:200,时间:7.703,帖子计数:100,起始ID:90YTY0响应:200,TI6.375,POST COUP:99,启动ID:90YTV8响应:200,时间:6.734,帖子数:97,启动ID:90YTSG响应:200,时间:7.328,帖子数:97,启动ID:200,时间:7.812,帖子数:98,起始ID:90YTMW响应:200,时间:7.812,帖子数:98,启动ID:90YTK4

So even though we asked for posts in batches of 100, many batches are short. There are two reasons for this. First, some posts are going to be in private communities you don’t have access too. You won’t be able to see them, there’s nothing you can do about that. Second, because post IDs aren’t assigned in perfect order, it could be that some of the missing posts haven’t been written yet. In that case, we just wait a few seconds and do another request with the missing IDs. Not a big deal.

顺便说一下,上面的代码也完全适用于评论。唯一的区别是评论ID开始t1_instead ofT3_



After I got it figured out, scraping all of Reddit in real-time wasn’t that tough. I launched F5Bot and it worked fine for a long time. Eventually, however, I ran into a second problem. Processing the data became the bottle-neck. Remember, I’m doing this on a tiny VPS. It has a fast connection, but an anemic CPU.

F5bot.has to search every post and comment for all of the keywords that all of my users have. So it started out as something like this:

foreach ($new_posts as $post) { foreach ($all_keywords as $keyword) { if (strpos($post, $keyword) !== FALSE) { ... found a relevant post ... } } }


最终我转换为使用了AHO-Corasick字符串搜索算法。It’s really slick. You put your keywords into a tree structure as a pre-processing step. Then you only need to look at each post one time to see which keywords it contains.




I hope you enjoyed the write-up! Special thanks to Intoli and Evan Sangaline for the idea to write this article and for hosting it here.


If you want to read more of my posts you can check out my blog at me on Github


If you enjoyed this article, then you might also enjoy these related ones.


ByAndre Perunicic


Read more

用户代理 - 使用Google Analytics和Circleci生成随机用户代理

ByEvan Sangaline
on August 30, 2018

A free dataset and JavaScript library for generating random user agents that are always current.

Read more


ByEvan Sangaline
on June 21, 2018


Read more