用Aopic算法执行有效的广泛爬网

通过安德烈埃朗尼奇|2018年9月16日

本文介绍了如何自适应在线页面重要性计算(Aopic)算法works. AOPIC is useful for performing efficient broad crawls of large slices of the internet. The key idea behind the algorithm is that pages are crawled based on a continuously improving estimate of page importance. This effectively allows the user of the algorithm to allocate the bulk of their limited bandwidth on the most important pages that their crawler encounters.

与着名的同样PageRank algorithm,Aopic迭代地根据其他页面链接到它的程度来确定页面重要性。关于AOPIC特别好的是,理解和实施是简单的,使用很少的内存和计算资源,并允许用户控制新页面的发现应该是有利于制作更准确的重要性估计。

获得页面的重要性和优先顺序,对更重要的页面的探索在真实世界的使用情况下至关重要。例如,为了充分利用他们的网络,互联网归档服务Internet Archive’s Wayback Machine要么常见的爬行has to decide which sites are important to update frequently, and which can be visited only occasionally. As another example, an advertising agency might wish to get a sense of which topics different publishers tend to write about, and what the general quality of a publisher’s content is in order to more effectively allocate their ad inventory. Since pages are constantly added and removed from the web and can change over time, they have to continuously update and refine their importance estimates with something like AOPIC.

This article starts by introducing a simplified version of AOPIC that works on a finite, unchanging graph–a simple mathematical abstraction of a network like the internet. This will allow us to actually run a version of the algorithm in the browser. Next, we explain how to overcome the limitations of this approach by generalizing the simple algorithm to work well on a dynamic, virtually infinite graph like the real internet.

The运行本文稍后介绍的Aopic窗口小部件的代码可从我们提供万博输10万怎么办Intoli-物品材料repository. Feel free to use it as a starting point for your own projects, and consider starring the repo! It’s a great way to find out about upcoming articles before they’re released, and to send us a signal about the kind of content you would like to see more of.

OPIC: Algori的基本版本thm

The internet can be mathematically abstracted as adirected graph。Each page becomes a node in the graph, and each link becomes an arrow between the nodes.

互联网作为图形

In this article, instead of labeling nodes with the urls they correspond to, we’ll use integer labels or simply leave them unlabeled. Under this representation, the actual internet would be a graph with infinitely many nodes due to the presence of dynamically generated pages. For example, a lot of sites employ infinite link cycles or “bot traps,” while others use JavaScript to render some meaningful content on every path under the domain. Even ignoring these pages, we’d end up with a graph so large so as to be virtually infinite for all practical purposes. Moreover, the internet’s graph changes from moment to moment, so any graph representation would immediately become outdated.

为了使算法的基本概念下来,我们将从有限,不变的图中以$ N $节点进行了一个版本,每个版本都与页面相对应。稍后,我们将介绍算法的扩展到更现实,任意大图。为了简化事物,假设每个页面都有至少一个链接到另一个节点。这种假设在算法工作的技术要求下提示,并且我们将在一些细节中介绍这一点。由于此版本的算法不考虑更改图形,因此作者的作者AOPIC paperdropped the first word, “Adaptive,” from the name, naming it On-Line Page Importance Computation (OPIC).

The OPIC algorithm uses a credit system to calculate the importance of every node. A finite number of credits, 100 say, are allocated to the nodes of the graph, and then swapped around as nodes are “crawled.” Each node $i$ has two numbers associated with it: the total number of credits currently allocated to the node, called the “cash” $C(i)$ of the node, and the sum total of credits that were previously associated to the page, called the credit “history” $H(i)$ of the node. With these definitions out of the way, we can describe how the OPIC crawling algorithm works:

  1. 初始化每个节点分配相同数量of the total credits to it, and setting its history to 0. If there are $n$ nodes in the graph, then we are just initializing the graph with $C(i) = 100/n$ and $H(i) = 0$ for all $i$.

  2. 使用一些选择节点$ i $crawling policy, such as random selection.

  3. Add an equal portion of $C(i)$ to every node which node $i$ links to. If node $i$ has $m$ edges to other nodes, then we would add $C(i)/m$ to each of them.

  4. 递增节点的历史记录$ h(i)$ c(i)$,并设置$ c(i)= 0 $。

The idea is that over time you allocate more credits to the nodes that are linked to the most. If the edges come from other nodes which themselves have a lot of inbound edges, then these will naturally receive more credits than nodes which are linked to from nodes with few inbound edges. Each node’s importance is at any given time defined as the total portion of credits in the system that have been assigned to the node so far, and it’s clear that this importance estimate improves as we crawl more nodes. Mathematically speaking, after the $t$-th iteration of the algorithm, we define the importance $X_t(i)$ of node $i$ “at time $t$” as

$$ x_t(i)= \ frac {h(i)+ c(i)} {\ sum_i h(i)+ 100} \。$$

节点的现金和历史也显然随着时间的推移而变化,但我们只会跟踪他们的最新值,从而避免将它们限制。

It can be proven that as long as each node is crawled infinitely often, then $X_t(i)$ will converge to some value as the number of iterations $t$ of the algorithm goes to infinity:

$$ x(i):= \ lim_ {t \ to \ infty} x_t(i)\,。$$

它还证明,随时收敛的收敛误差与T $的误差成反比总历史of the system, $\left| H_t \right| = \sum_i H(i)$, which is just sum of the current history of every node in the graph at time $t$. So, we can stop the algorithm when the value

$$ \ frac {1} {\左|h_t \ reval |} $$

stops changing by very much.

Demo of OPIC

To get a clearer sense of the cash allocation process, let’s run the algorithm on a toy graph in the browser. The following widget lets you run the algorithm to see how the values update. You can manually go forward step-by-step, or press play and see the values update over time (control the speed by pressing-较慢的迭代和+为了更快的迭代)。默认情况下,每个节点由具有两个数字中的圆圈表示。顶部号码是节点的当前现金,底部数字是节点之前的信用历史记录。您可以通过使用控件切换它们来表示每个节点的重要性。

For now, we’ll just use random selection as the “crawling policy”, since the graph is small and will stabilize pretty quickly. We’ll get rid of that policy a little bit later.

After running the simulation for a bit, you should see that the importance starts stabilizing. The node in the middle should emerge as the one with most importance, which is hardly surprising given that it’s by far the most linked to one. A few other interesting tidbits can be observed, however. First, the node in the middle of the top row ends up with almost no importance–as expected, because it has no inbound edges! Second, although several nodes share the same number of inbound edges, their importance differs. In particular, the nodes which are pointed to the more important nodes end up being more important themselves. Obviously this is a bit of a contrived example, but it is already enough to serve as a sanity check for the algorithm.

AOPIC: Generalizing the Algorithm to Dynamic Graphs

来自上一节的算法包含了一些简化假设:

  1. 假设每个节点包含到某些其他节点的链接;
  2. 爬行政策是随机选择;和
  3. the graph is assumed to be finite and unchanging.

Obviously these don’t hold up on the real internet, so a few modifications need to be made to the algorithm in order to make it useful in practice.

处理图形的连接

Pages online sometimes don’t have any outgoing links, which is a problem for the current algorithm because they would end up being the last stop for cash and cause it to effectively disappear from the rest of the graph. The simple way around this issue is to add a “virtual” page which every other page links to, and which links back to every other page (or a carefully chosen subset of pages at any given time–more on that in a moment). The virtual page will therefore accumulate credits during every iteration of the algorithm, and eventually be picked and distribute credits back into the system.

Strongly Connected Graphs

On a more technical note, this virtual page makes the graph “strongly connected.” I.e., it makes it possible to follow a directed path between any two nodes in the graph. A similar trick was applied in the case of Google’s originalPageRank algorithm。在这种情况下,图形被编码为矩阵,如果在$ i $和页面$ j $之间存在链接,则为$(i,j)$-th条目是肯定的。否则诱导字符串连接,它们通过向每个条目添加少数来扰乱矩阵。

A Better Crawling Policy

上述简化算法的第二个问题是页面随机爬网。虽然这技术上得到了算法融合eventually,在大图中这样做会很慢。如果我们真的对大部分资源都对网上最重要的部分度过大部分,我们都可以采用greedy crawling policyunder which we choose the page with the largest amount of cash at any given time.

事实上,如果我们使用贪婪的选择策略而不是随机节点选择,将实现融合two times as fast- 只要我们假设现金线性地累计现金,就是通过实验证实的Aopic算法的作者索赔,这意味着直观。这很容易看到,并且是一个有趣的异国纸,所以让我们做一个小的转移。

每次访问节点时,汇聚误差绑定的分母$ \左|H_T \右| $乘量分配给页面的现金金额。在$ N $节点的图表中,在随机策略下爬网,每个节点都将on averageget $100/n$ of the total $100$ credits in cash on each visit, and increase the denominator by a corresponding amount. Under the greedy crawling policy, we don’t know the average cash of a node when it’s visited, but let’s call that number $\alpha$. Since after each visit the cash resets to 0,on average每个节点随时都会有$ \ alpha / 2 $现金。有$ n $节点,实际上实际上$ 100 = n(\ alpha / 2)$,因此$ \ alpha = 200 / n $。因此,汇聚误差绑定的分母随随机策略的增加两倍。整齐!

贪婪的爬行政策也使我们能够在新页面发现和重要学习率之间进行权衡。这个想法是,我们可以修改虚拟节点的传出边沿,以偏向图表的特定部分的现金分配。如果我们仅将虚拟节点点视为先前未得到检控的节点,那么这些节点会随着时间的推移收到更大的现金,并且更频繁地刮擦。这将有效鼓励探索甚至更新的链接发现。相反,如果是有虚拟页面所有其他节点,那么对不受检测的节点没有偏见,并且刮刀将在图表的最重要的部分上花费更多时间。

以下是比以前略大的合成示例,这次使用贪婪的策略和虚拟页面。虚拟页面的节点未呈现,但每次选用时,您都会看到所有节点亮起。如果图表不可读取,只需点击刷新,直到有利的配置出来 - 图的连接是相同的,但是在每次略微不同地旋转图表的物理模拟。图表是24节点scale-free network(使用额外的隐藏虚拟节点),通常假定互联网的图形是。

Importances should stabilize pretty quickly here, and you should see that the more connected nodes toward the middle of the graph become more important than the relatively isolated pages on the outside.

改变图形

So far, we have only mentioned finite and unchanging graphs. On the graph of the real internet, nodes are constantly added and removed, links between them modified, and even the most comprehensive crawl eventually becomes outdated. Moreover, pages are crawled at unpredictable intervals. To deal with this, and make AOPIC out of OPIC, the OPIC algorithm can be adapted to work on a sliding time window.

基本思想是使用来自每个页面的最近访问的数据,形成页面重要性的不完美模型。如果我们再次假设线性累积信用,则出现了一种简单的方式,因为那么我们可以轻松地在以前访问以来节点累计的现金之间插值,并且到目前为止节点获得的历史记录。我们有效地希望在窗口前“忘记”历史​​。

要更具体地说,让我们来解决$ $ HOWE的时间窗口和自上次访问以来的小时数,“刮段”$ s $。后者可以在访问之间发生变化,因此让我们考虑两种情况:$ t

首先,$ t

The T Smaller than S Case

通过形成比率,我们有那个

$$\frac{C(i)}{S} = \frac{H_t(i)}{T}\,, $$

where as before $H_t(i)$ is the history of node $i$ at time $t$, and $C(i)$ the cash accumulated since the last visit. Rearranging, we see that if $T < S$, then “resetting” the history means that we scale the cash by the fraction of credits that would have been assigned linearly to the page during the window:

$$ h_t(i)= c(i)\ frac {t} {s} \。$$

下一个,$ s

The S Smaller than T Case

The ratios we are interested in are

$$ \ frac {h_t(i) - c(i)} {t - s} = \ frac {h_ {t-s}} {t} \。$$

Rearranging, we see that the previous history is scaled down by the portion of $T$ which falls outside the current scraping period, and then incremented by all the cash accumulated by the page during the current scraping period:

$$ h_t(i)= h_ {t-s}(i)\ left(\ frac {t - s} {t} \右)+ c(i)\。$$

In both cases, $C(i)$ is reset to zero, and we of course have to keep track of $S$ for every page now, too.

随着$ t $ GROWS,这些动态历史估计将在此情况下接近原始定义,因为在$ T - S \ SIMEQ 0 $中,我们的重要性估计将越来越少,随着时间的变化而变化。另一方面,如果窗口尺寸非常小,那么我们就在大多数情况下只是跟踪可能导致估计波动的现金。AOPIC’s authors found that a window size of 4 months is appropriate for open-ended broad crawls, but the paper was written in 2003 and the internet and web scrapers have sped up, so we can likely get good results on smaller parts on the web with substantially smaller windows. Implementing the algorithm is left as an exercise for the reader. ;)

结论

如果您想深入挖掘此算法,还有一些更多细节the original paper。这些包括融合误差的基准,用于算法的变化及其参数,融合短证和与PageRank的进一步比较。还有一些关于实施的注释,但它们的有用性将取决于您的特定刮擦设置。

无论如何解析广泛爬网的页面,您都将不可避免地处理单个网站阻止,以及蒸馏等大网络。要成功剪切大量页面,您需要一个聪明的代理,它是如何路由和提出请求的。幸运的是,我们只为你了!

万博输10万怎么办Intoli智能代理

如果您正在进行严重的Web刮擦,则使用代理是必须的。我们的万博输10万怎么办Intoli智能代理服务使得易于停止被BOT缓解服务阻止。您的请求智能地通过Clean Residential IPS进行路由,在那里他们可能会成功,无法自动重试失败的请求,并且您甚至可以通过使用自定义化的远程浏览器进行请求,使其难以检测到刮刀。在下面输入您的电子邮件地址以访问我们为所有网络刮擦使用的相同的工具!

如果您喜欢这篇文章,请考虑订阅我们的邮件列表。跟随我们的文章的另一个好方法是明星万博输10万怎么办Intoli-物品材料repository, since it gets updated with supporting materials for each one of our article before it’s released.

Suggested Articles

如果您喜欢这篇文章,那么您也可能享受这些相关的。

User-Agents — Generating random user agents using Google Analytics and CircleCI

通过埃文Sangaline
2018年8月30日

一个免费的数据集和JavaScript库,用于生成始终最新的随机用户代理。

阅读更多

F5bot如何啜饮所有的reddit

通过刘易斯van winkle.
on July 30, 2018

F5BOT的创建者详细解释了它是如何运作的,以及如何每天刮掉百万冗员的评论。

阅读更多

没有API是最好的API - 权力主张的优雅力量

通过埃文Sangaline
on July 24, 2018

A look at what makes power-assert our favorite JavaScript assertion library, and an interview with the project's author.

阅读更多

评论