# 用Aopic算法执行有效的广泛爬网

This article starts by introducing a simplified version of AOPIC that works on a finite, unchanging graph–a simple mathematical abstraction of a network like the internet. This will allow us to actually run a version of the algorithm in the browser. Next, we explain how to overcome the limitations of this approach by generalizing the simple algorithm to work well on a dynamic, virtually infinite graph like the real internet.

The运行本文稍后介绍的Aopic窗口小部件的代码可从我们提供万博输10万怎么办Intoli-物品材料repository. Feel free to use it as a starting point for your own projects, and consider starring the repo! It’s a great way to find out about upcoming articles before they’re released, and to send us a signal about the kind of content you would like to see more of.

## OPIC: Algori的基本版本thm

The internet can be mathematically abstracted as adirected graph。Each page becomes a node in the graph, and each link becomes an arrow between the nodes. In this article, instead of labeling nodes with the urls they correspond to, we’ll use integer labels or simply leave them unlabeled. Under this representation, the actual internet would be a graph with infinitely many nodes due to the presence of dynamically generated pages. For example, a lot of sites employ infinite link cycles or “bot traps,” while others use JavaScript to render some meaningful content on every path under the domain. Even ignoring these pages, we’d end up with a graph so large so as to be virtually infinite for all practical purposes. Moreover, the internet’s graph changes from moment to moment, so any graph representation would immediately become outdated.

The OPIC algorithm uses a credit system to calculate the importance of every node. A finite number of credits, 100 say, are allocated to the nodes of the graph, and then swapped around as nodes are “crawled.” Each node $i$ has two numbers associated with it: the total number of credits currently allocated to the node, called the “cash” $C(i)$ of the node, and the sum total of credits that were previously associated to the page, called the credit “history” $H(i)$ of the node. With these definitions out of the way, we can describe how the OPIC crawling algorithm works:

1. 初始化每个节点分配相同数量of the total credits to it, and setting its history to 0. If there are $n$ nodes in the graph, then we are just initializing the graph with $C(i) = 100/n$ and $H(i) = 0$ for all $i$.

2. 使用一些选择节点$i$crawling policy, such as random selection.

3. Add an equal portion of $C(i)$ to every node which node $i$ links to. If node $i$ has $m$ edges to other nodes, then we would add $C(i)/m$ to each of them.

4. 递增节点的历史记录$h（i）$ c（i）$，并设置$ c（i）= 0 $。 The idea is that over time you allocate more credits to the nodes that are linked to the most. If the edges come from other nodes which themselves have a lot of inbound edges, then these will naturally receive more credits than nodes which are linked to from nodes with few inbound edges. Each node’s importance is at any given time defined as the total portion of credits in the system that have been assigned to the node so far, and it’s clear that this importance estimate improves as we crawl more nodes. Mathematically speaking, after the$t$-th iteration of the algorithm, we define the importance$X_t(i)$of node$i$“at time$t$” as $$x_t（i）= \ frac {h（i）+ c（i）} {\ sum_i h（i）+ 100} \。$$ 节点的现金和历史也显然随着时间的推移而变化，但我们只会跟踪他们的最新值，从而避免将它们限制。 It can be proven that as long as each node is crawled infinitely often, then$X_t(i)$will converge to some value as the number of iterations$t$of the algorithm goes to infinity: $$x（i）：= \ lim_ {t \ to \ infty} x_t（i）\，。$$ 它还证明，随时收敛的收敛误差与T$的误差成反比总历史of the system, $\left| H_t \right| = \sum_i H(i)$, which is just sum of the current history of every node in the graph at time $t$. So, we can stop the algorithm when the value

$$\ frac {1} {\左|h_t \ reval |}$$

stops changing by very much.

### Demo of OPIC

To get a clearer sense of the cash allocation process, let’s run the algorithm on a toy graph in the browser. The following widget lets you run the algorithm to see how the values update. You can manually go forward step-by-step, or press play and see the values update over time (control the speed by pressing-较慢的迭代和+为了更快的迭代）。默认情况下，每个节点由具有两个数字中的圆圈表示。顶部号码是节点的当前现金，底部数字是节点之前的信用历史记录。您可以通过使用控件切换它们来表示每个节点的重要性。

For now, we’ll just use random selection as the “crawling policy”, since the graph is small and will stabilize pretty quickly. We’ll get rid of that policy a little bit later.

After running the simulation for a bit, you should see that the importance starts stabilizing. The node in the middle should emerge as the one with most importance, which is hardly surprising given that it’s by far the most linked to one. A few other interesting tidbits can be observed, however. First, the node in the middle of the top row ends up with almost no importance–as expected, because it has no inbound edges! Second, although several nodes share the same number of inbound edges, their importance differs. In particular, the nodes which are pointed to the more important nodes end up being more important themselves. Obviously this is a bit of a contrived example, but it is already enough to serve as a sanity check for the algorithm.

## AOPIC: Generalizing the Algorithm to Dynamic Graphs

1. 假设每个节点包含到某些其他节点的链接;
2. 爬行政策是随机选择;和
3. the graph is assumed to be finite and unchanging.

Obviously these don’t hold up on the real internet, so a few modifications need to be made to the algorithm in order to make it useful in practice.

### 处理图形的连接

Pages online sometimes don’t have any outgoing links, which is a problem for the current algorithm because they would end up being the last stop for cash and cause it to effectively disappear from the rest of the graph. The simple way around this issue is to add a “virtual” page which every other page links to, and which links back to every other page (or a carefully chosen subset of pages at any given time–more on that in a moment). The virtual page will therefore accumulate credits during every iteration of the algorithm, and eventually be picked and distribute credits back into the system.

#### Strongly Connected Graphs

On a more technical note, this virtual page makes the graph “strongly connected.” I.e., it makes it possible to follow a directed path between any two nodes in the graph. A similar trick was applied in the case of Google’s originalPageRank algorithm。在这种情况下，图形被编码为矩阵，如果在$i$和页面$j$之间存在链接，则为$（i，j）$-th条目是肯定的。否则诱导字符串连接，它们通过向每个条目添加少数来扰乱矩阵。

### A Better Crawling Policy

Importances should stabilize pretty quickly here, and you should see that the more connected nodes toward the middle of the graph become more important than the relatively isolated pages on the outside.

### 改变图形

So far, we have only mentioned finite and unchanging graphs. On the graph of the real internet, nodes are constantly added and removed, links between them modified, and even the most comprehensive crawl eventually becomes outdated. Moreover, pages are crawled at unpredictable intervals. To deal with this, and make AOPIC out of OPIC, the OPIC algorithm can be adapted to work on a sliding time window. $$\frac{C(i)}{S} = \frac{H_t(i)}{T}\,,$$

where as before $H_t(i)$ is the history of node $i$ at time $t$, and $C(i)$ the cash accumulated since the last visit. Rearranging, we see that if $T < S$, then “resetting” the history means that we scale the cash by the fraction of credits that would have been assigned linearly to the page during the window:

$$h_t（i）= c（i）\ frac {t} {s} \。$$

# 结论

## Suggested Articles #### User-Agents — Generating random user agents using Google Analytics and CircleCI

2018年8月30日 #### F5bot如何啜饮所有的reddit

on July 30, 2018

F5BOT的创建者详细解释了它是如何运作的，以及如何每天刮掉百万冗员的评论。 #### 没有API是最好的API - 权力主张的优雅力量

on July 24, 2018

A look at what makes power-assert our favorite JavaScript assertion library, and an interview with the project's author.