How F5Bot Slurps All of Reddit

ByLewis Van Winkle| July 30, 2018

In this guest post, Lewis Van Winkle talks aboutF5bot.那a free service that emails you when selected keywords are mentioned on Reddit, Hacker News, or Lobsters. He explains in detail how F5Bot is able to process millions of comments and posts from Reddit every day on a single VPS. You can check out more of Lewis Van Winkle’s writing atcodeplea.com那and his open source contributions atgithub.com/codeplea

我经营免费服务,F5bot.。The basic premise is that you enter a few keywords to monitor, and it’ll email you whenever those keywords show up on Reddit or Hacker News. I’m hoping to add a few more sites soon.

This post is going to be about how I built the service. Specifically, we’re going to focus on scraping every post and comment from Reddit in real time.

F5Bot是用PHP编写的。它在一个非常的编程boring, straight-forward manner. I’m not doing anything special, really, but it does manage to scan every Reddit post and comment in real-time. And it does it all on a tiny low-end VPS. Sometimes people are surprised when I tell them this. “Reddit has huge volume,” they say, “Reddit runs a room-full of super computers, and they still have downtime like every day! How do you it?” Well, I’m going to tell you.

Reddit每天获得大约500,000个新员额和4,000,000个新评论。合并,约50 /秒。但是,帖子并没有以持续的速度进入,因此我们最好能够处理几百一秒钟。你可以认为PHP很慢,但它绝对可以处理这个问题。我的意思是在计算机上搜索几百个文件需要多长时间?真的,这里唯一的关键是我们需要以仔细的方式编写程序。我们不能浪费资源,但如果我们小心,我们将能够处理流量。

So let’s get started!

Reddit’s Listing API

Reddit提供了一个很好的JSON API。有一些人good documentation here。I’m pretty sure their site uses it internally to render the HTML pages for visitors. So we’ll just use that to grab the most recent posts. It’s athttps://www.reddit.com/r/all/new/.json.那and I recommend you open that up in your browser to follow along.

In PHP we’ll just grab it with a simple卷曲然后解码它json_decode()。(As a side note, I’m going to completely skip over error handling because it’s so boring. I mean it’s the most important part of running an actual real-live server, but it’s boring so I won’t talk about it anymore here.)

$ ch = curl_init();curl_setopt($ ch,curlopt_url,https://www.reddit.com/r/all/new/.json“);curl_setopt($ ch,curlopt_returntransfer,1);curl_setopt($ ch,curlopt_useragent,'testbot2000');/ *别忘了命名自己!* / $ posts = curl_exec($ ch);curl_close($ ch);$ posts = json_decode($ posts,true);print_r($帖子);

这是一个示例,显示了你会回来的json的基本结构。我通过留下25个帖子中的2个,并通过去除大部分钥匙/值对(有很多)来修剪它。

{“种类”:“列表”,“数据”:{“modhash”:“5b1dwrfbn4a9f8c200824114ECC7110EE2BAA4D95DC2C106B9”,“DIST”:25,“儿童”:[{“种类”:“t3”,“数据”:{“appleted_at_utc”:null,“supreddit”:“吉他”,“自我提议”:“我每天练习吉他。大多数人每天,我......”,“user_reports”:[],“保存”:false,“mod_reason_title”:false,“mod_reason_title”:null,“gilded”:0,“点击”:false,“title”:“[讨论]感谢这里的人”,“subreddit_name_prefixed”:“r / guitar”,“hidden”:false,“id”:false,“id”:false,“id”:false,“ID”:91sed5“,/ *更多键/值对。* /}},{“cind”:“t3”,“数据”:{“apploded_at_utc”:null,“fumreddit”:“gpdwin”,“selftext”:“购买单位的最便宜方式是什么。。。", "user_reports": [], "saved": false, "mod_reason_title": null, "gilded": 0, "clicked": false, "title": "Looking for advice on buying a WIN 2 in the UK", "subreddit_name_prefixed": "r/gpdwin", "hidden": false, "id": "91secm", /* Many more key/value pairs. */ }, /* Many more objects. */ } ], "after": "t3_91secm", "before": null } }

根据您具有PHP设置的方式,CURL可能会对安全的HTTPS连接遇到问题。如果是这样的情况,您可以绕过它curl_setopt($ ch,curlopt_sl_verifyPeer,false)

Now we have all the recent posts in$posts作为一个漂亮的清洁PHP阵列。这将是一篇短篇小文章,我们几乎已经完成了。

但是,等等,我们得到了多少篇文章?我只想要我的刮刀每隔几分钟运行,所以它会更好。

print("Downloaded " . count($posts['data']['children']) . " posts.\n");
Downloaded 25 posts.

好吧,这有点乏味。我只想每隔几分钟跑我的刮刀,而不是不断地运行我的刮刀。所以我们最好至少得到一分钟的帖子。这是300个帖子,让我们抓住1000次只是为了安全。我们可以添加一个?limit=1000到网址的末尾。

... curl_setopt($ ch,curlopt_url,https://www.reddit.com/r/all/new/.json?limit=1000“);...打印(“下载”。计数($帖子['data'] ['儿童])。“帖子\ n”);
下载了100个帖子。

事实证明,Reddit有限制。它只会在一次展示100个帖子。

虽然有一个简单的解决方案。如果你仔细观察JSONyou’ll see that it has anfield with an ID. Thefield tells us which post id came之前这里最古老的帖子。所以我们真的节省了。我们可以再次调用API,但请询问此帖子ID的100个帖子!

$after = $posts['data']['after']; ... curl_setopt($ch, CURLOPT_URL, "https://api.reddit.com/api/info.json?limit=100&id=$after"); ...

This works well up to a point. That point is about 1,000 posts. After that it will either loop back to the beginning and start showing you posts you’ve already seen, or it will just stop returning anything. I’m not a big fan of that, because if my scraper has a little down time I’d like it to go back and grab the old posts too, but maybe you’re not worried about that.

现在我在这里说过帖子的一切也是如此评论。唯一的区别是新的区别在于https://www.reddit.com/r/all/comments/.json.instead ofhttps://www.reddit.com/r/all/new/.json.,他们更快地走了。

But we’ve got an even bigger problem. Each request takes a few seconds (or much longer if Reddit servers are loaded), and that means we can’t pull comments as quickly as they’re coming in. Remember, comments get posted at a rate of 50 per second, and it can be much more during peak traffic.

Making Multiple Simultaneous Connections

所以我们必须立刻制作一堆连接。PHP使这很容易curl_multifunctions, but we’re getting ahead of ourselves. If we’re just pulling the listings, how do we know the id to start the next request on if the current request hasn’t finished?

So each Reddit thing, like a post or comment, has a unique ID. Post IDs start withT3_and comment IDs start witht1_。The API is very inconsistent about whether it uses the prefix or not. Some places of the API require the prefix, but some just use the ID with no prefix.

IDS本身在Base 36中编码 - 因此它们使用十位数0-9,以及二十六个字母A-Z。在PHP中我们可以使用base_convert()to convert them to decimal and do math with them.

所以在这里我们获得最新的帖子ID,并将其转换为小数。

$ ch = curl_init();卷曲_setopt($ch, CURLOPT_URL, "https://www.reddit.com/r/all/new/.json?limit=1"); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $posts = curl_exec($ch); curl_close($ch); $posts = json_decode($posts, true); $id = $posts['data']['children'][0]['data']['id']; $id = base_convert($id,36,10);

从中可以轻松地进行数学$ ID然后将其转换为Reddit的格式:base_convert($ ID,10,36)。这非常方便。

The best part is all the bandwidth you’re going to save by using base 36 instead of 10! You’re going to need it too, since almost the entire Reddit API consists of crap you’ll almost certainly never need. I mean do you really want削弱名称和削弱_name_prefixed? They’re the same, one just has an “r/” in front of it. In fact, take a look at that JSON again:https://www.reddit.com/r/all/new/.json.。Basically I use the自我文本削弱永久链接URL.andtitle。其他95%的它只是浪费带宽。实际上,超过一半的回复是JSON关键文本。因此,即使您要使用每个值,您也可以在JSON密钥名称上花费更多的带宽而不是实际数据。哦,我猜它仍然比XML更好。

Anyway, now we can manipulate Reddit IDs, and it turns out that both comments and posts are assigned IDs more or less serially. They’re not strictly monotonic in the short term, but they are in the long term. (i.e. a post from an hour ago will have a lower ID than a post right now, but a post from 2 seconds ago may not)

因此,我们可以通过简单地从起始ID中减去100来找到启动每个API呼叫的位置。使用此方法我们可以立即下载一堆帖子。这是最新的1,000个帖子,同时下载。它假设您已将最新帖子的ID加载到$ ID(in decimal).

$ mh = curl_multi_init();$ curls = array();for($ i = 0; $ i <10; ++ $ i){$ ch = curl_init();$ url =“https://www.reddit.com/r/all/new/.json?limit=100&awter=t3_”。base_convert($ ID,10,36);打印(“$ url \ n”);curl_setopt($ ch,curlopt_url,$ url);curl_setopt($ ch,curlopt_returntransfer,1);curl_multi_add_handle($ mh,$ ch);$ curls [] = $ ch; $id -= 100; } $running = null; do { curl_multi_exec($mh, $running); } while ($running); echo "Finished downloading\n"; foreach ($curls as $ch) { $response_code = curl_getinfo($ch, CURLINFO_HTTP_CODE); $response_time = curl_getinfo($ch, CURLINFO_TOTAL_TIME); $posts = curl_multi_getcontent($ch); $posts = json_decode($posts, true); print("Response: $response_code, Time: $response_time, Post count: " . count($posts['data']['children']) . ", Starting id: " . $posts['data']['children'][0]['data']['id'] . "\n"); curl_multi_remove_handle($mh, $ch); } curl_multi_close($mh);

If you try to push that back past about 1,000 posts you’ll just get empty replies. Assuming that you’re ok with the 1,000 post limit, we’re done. We’ve got the API totally figured out, scraping is a solved problem, and you’re ready to finish building your bot.

Except… if you look at the post IDs you’re getting, you’ll find some funny business. Some posts are repeated. That’s not a big deal, we can ignore them, but also some posts are missing. That’s a big deal! We aren’t slurping all of Reddit if we’re missing posts.

丢失的文章是由IDs不引起的totally in order. Or maybe it’s just some other bug. I don’t know. In any case, there’s nothing we can do about it here that I know of. We’ll need another solution. The 1,000 post limit was really annoying anyway. And the 1,000 comment limit meant we were going to need to scrape every 10 seconds just to avoid missing anything. That wasn’t really viable long-term.

一种全新的方法

Reddit’s listing APIs are terrible for trying to see every post and comment. To be fair, they probably weren’t designed for that. I guess they are just designed to work with their website to render pages. No real human would want to keeping clicking past 1,000 posts.

所以这是我最终使用的方法,这工作得更好:请求每个帖子的ID。这是对的,而不是批量询问100批次,我们需要通过其帖子ID单独询问每个帖子。我们将为评论做同样的评论。

这有一些巨大的优势。首先,最重要的是,我们不会错过任何帖子。其次,我们不会得到两次帖子。第三,我们可以尽可能地回到过去。这意味着我们每隔几分钟就可以批量批量运行我们的刮刀。如果有停机,我们可以回去获得我们错过的旧帖子。这是完美的。

Here’s how we do it. We find the starting post ID, and then we get posts individually fromhttps://api.reddit.com/api/info.json?id=.。我们需要将一个大帖子ID列表添加到该API URL的末尾。

So here are 2,000 posts, spread out over 20 batches of 100 that we download simultaneously. It assumes you’ve already got the last post ID loaded into$ IDbase-36.

print("Starting id: $id\n"); $urls = array(); for ($i = 0; $i < 20; ++$i) { $ids = array(); for ($j = 0; $j < 100; ++$j) { $ids[] = "t3_" . $id; $id = base_convert((base_convert($id,36,10) - 1), 10, 36); } $urls[] = "https://api.reddit.com/api/info.json?id=" . implode(',', $ids); } $mh = curl_multi_init(); $curls = array(); foreach ($urls as $url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_USERAGENT, "This is actually needed!"); print("$url\n"); curl_multi_add_handle($mh, $ch); $curls[] = $ch; } $running = null; do { curl_multi_exec($mh, $running); } while ($running); print("Finished downloading\n"); foreach ($curls as $ch) { $response_code = curl_getinfo($ch, CURLINFO_HTTP_CODE); $response_time = curl_getinfo($ch, CURLINFO_TOTAL_TIME); $posts = json_decode(curl_multi_getcontent($ch), true); print("Response: $response_code, Time: $response_time, Post count: " . count($posts['data']['children']) . ", Starting id: " . $posts['data']['children'][0]['data']['id'] . "\n"); curl_multi_remove_handle($mh, $ch); } curl_multi_close($mh);

我们最终会有一些很长的URL,如this那but Reddit doesn’t seem to mind.

另请注意,我们实际上需要为“info.json”API点设置用户代理。显然Reddit从中阻止了PHP /卷曲。为一切设置用户代理是一个很好的想法;我只省略了它以缩短代码样本。看到Reddit的API访问规则for more info.

从上面的脚本,您将获得这样的输出:

完成下载响应:200,时间:5.735,帖子数:96,起始ID:90YV0W响应:200,时间:7.36,帖子数:99,启动ID:90YUY4响应:200,时间:6.532,帖子数:100,开始ID:90,时间:6.547,帖子数:99,启动ID:90 yuSK响应:200,时间:7.344,帖子数:99,启动ID:90 ups响应:200,时间:7.25,帖子数:94,起始ID:90YUN0响应:200,时间:6.672,帖子数:83,起始ID:90yuk8响应:200,时间:7.469,帖子数:97,启动ID:90YUHG响应:200,时间:6.344,帖子数:91,启动ID:90 yueo响应:200,时间:7.187,帖子数:97,起始ID:90 yubw响应:200,时间:22.734,帖子数:96,启动ID:90YU94响应:200,时间:7.453,帖子计数:93,启动ID:90YU6C响应:200,时间:7.359,帖子数:99,起始ID:90YU3K响应:200,时间:7.812,帖子数:100,启动ID:90YU0S响应:200,时间:7.703,帖子计数:100,起始ID:90YTY0响应:200,TI6.375,POST COUP:99,启动ID:90YTV8响应:200,时间:6.734,帖子数:97,启动ID:90YTSG响应:200,时间:7.328,帖子数:97,启动ID:200,时间:7.812,帖子数:98,起始ID:90YTMW响应:200,时间:7.812,帖子数:98,启动ID:90YTK4

So even though we asked for posts in batches of 100, many batches are short. There are two reasons for this. First, some posts are going to be in private communities you don’t have access too. You won’t be able to see them, there’s nothing you can do about that. Second, because post IDs aren’t assigned in perfect order, it could be that some of the missing posts haven’t been written yet. In that case, we just wait a few seconds and do another request with the missing IDs. Not a big deal.

顺便说一下,上面的代码也完全适用于评论。唯一的区别是评论ID开始t1_instead ofT3_

您可以在加载注释时遇到的另一个问题是API不会返回注释附加的帖子的标题。如果您需要这个(f5bot,您必须使用parent_id字段获取父帖的ID。然后,您需要一个单独的呼叫来加载那篇文章,以便您可以抓住其标题。幸运的是,您可以将一堆这些呼叫批量批量批量划分为一个请求。

工作数据

After I got it figured out, scraping all of Reddit in real-time wasn’t that tough. I launched F5Bot and it worked fine for a long time. Eventually, however, I ran into a second problem. Processing the data became the bottle-neck. Remember, I’m doing this on a tiny VPS. It has a fast connection, but an anemic CPU.

F5bot.has to search every post and comment for all of the keywords that all of my users have. So it started out as something like this:

foreach ($new_posts as $post) { foreach ($all_keywords as $keyword) { if (strpos($post, $keyword) !== FALSE) { ... found a relevant post ... } } }

正如您可以想象的那样,我可以获得更多的用户,更多关键词,最终正在搜索数千个关键字的每个帖子。它有点慢。

最终我转换为使用了AHO-Corasick字符串搜索算法。It’s really slick. You put your keywords into a tree structure as a pre-processing step. Then you only need to look at each post one time to see which keywords it contains.

我在PHP中找不到AHO-Corasick实现,所以我自己写了一下。这是在这里的github上

Conclusion

在实时刮擦所有Reddit不需要大量的处理能力。你只需要小心,知道如何在他们的API周围工作。如果您实施了任何这些想法,请务必遵循Reddit的API访问规则。提供公共API是酷酷的。

I hope you enjoyed the write-up! Special thanks to Intoli and Evan Sangaline for the idea to write this article and for hosting it here.

如果您想知道Reddit或黑客新闻的人谈论您,您的公司或产品,请给F5Bot试试。免费。

If you want to read more of my posts you can check out my blog athttps://codeplea.com.orfollow me on Github

建议的文章

If you enjoyed this article, then you might also enjoy these related ones.

用Aopic算法执行有效的广泛爬网

ByAndre Perunicic
2018年9月16日

了解如何在广泛的爬网中估算页面重要性并分配带宽。

Read more

用户代理 - 使用Google Analytics和Circleci生成随机用户代理

ByEvan Sangaline
on August 30, 2018

A free dataset and JavaScript library for generating random user agents that are always current.

Read more

开发人员讨论Web刮擦的休闲社区

ByEvan Sangaline
on June 21, 2018

万博输10万怎么办Intoli正在推出一个名为Web刮板的新休闲社区,开发人员可以聊天Web刮擦。

Read more

Comments