How is pagerank calculated in a distributed way?

Question

I understand the the idea behind pagerank and have implemented it(when reading the book "programming collective intelligence").

But I read it could be distributed across several servers(as I guess google is doing). I'm a bit confused because according to my understanding you needed the entire graph in order to do page rank on it since each ranking was relative to others ranking.

I found the wiki article but it didn't explain much.

Any suggestions of how this is possible? Also, bonus question: is the technique to do distributed pagerank exclusive to pagerank or can the method used be applied to other machine learning algorithms applied to graphs?

score 9 · Accepted Answer · edited Oct 15 '12 at 01:14

The state of the art way of calculating PageRank is with the Google Pregel framework. I'm pretty sure that they have something more sophisticated right now, but that is the latest published effort.

You can read more details about it in the research blog. Or read the published paper here.

I'm working on an open source version of the Bulk Synchronous Parallel paradigm called Apache Hama. There is also Apache Giraph which solely focusses on the graph usecases and lots of others.

Like mfrankli mentioned, there is also the MapReduce framework (Apache Hadoop for example) that can be used to calculate PageRank, but it is not efficient for iterative algorithms.

The noteworthy thing to add is that both solutions (MapReduce and BSP) are batch solutions, so they may be used to recalculate the PageRank for the complete webgraph. Since Google updates are much faster than batch-algorithms, you can expect that they frequently recalculate PageRank on subgraphs.

score 1 · Answer 2 · answered May 14 '20 at 03:29

Let

| 0 0 0 1 0 |
| 0 0 0 1 0 |
| 0 0 0 1 1 |
| 1 1 1 0 0 |
| 0 0 1 0 0 |

be an adjacency matrix (or a graph). Then transition matrix M in PageRank will be

| 0 0   0 1/3 0 |
| 0 0   0 1/3 0 |
| 0 0   0 1/3 1 |
| 1 1 1/2   0 0 |
| 0 0 1/2   0 0 |

which is column stochastic, irreducible, and aperiodic.

MapReduce starts from here. Serialized input for mappers will be like

1 -> 4
2 -> 4
3 -> 4 , 5
4 -> 1 , 2 , 3
5 -> 3

and mappers will emit the followings:

< 1 , [4] >
< 4 , 1 >

< 2 , [4] >
< 4 , 1 >

< 3 , [4 , 5] >
< 4 , 1/2 >
< 5 , 1/2 >

< 4 , [1, 2, 3] >
< 1 , 1/3 >
< 2 , 1/3 >
< 3 , 1/3 >

< 5 , [3] >
< 3 , 1 >

Mapper outputs will grouped by key and taken by reducers. If we have 5 reducers it will be like:

R1 takes [4]       , 1/3           then computes 1/5*(1/3)           =  2/30
R2 takes [4]       , 1/3           then computes 1/5*(1/3)           =  2/30
R3 takes [4, 5]    , 1/3 , 1       then computes 1/5*(1/3 + 1)       =  8/30
R4 takes [1, 2, 3] ,   1 , 1 , 1/2 then computes 1/5*(  1 + 1 + 1/2) = 15/30
R5 takes [3]       , 1/2           then computes 1/5*(1/2)           =  3/30

Now the first (power) iteration is over. During the following reduce jobs, reducers will emit like what mappers do, however, PR computed will be used instead of 1:

< 1 , [4] >
< 4 , 2/30 >

< 2 , [4] >
< 4 , 2/30 >

< 3 , [4 , 5] >
< 4 , 4/30 >
< 5 , 4/30 >

< 4 , [1, 2, 3] >
< 1 , 5/30 >
< 2 , 5/30 >
< 3 , 5/30 >

< 5 , [3] >
< 3 , 3/30 >

Repeat reduce jobs until it converges enough or you are satisfied.

score 0 · Answer 3 · answered Oct 14 '12 at 17:56

0

MapReduce provides some interesting background, and might clear up how your would parallelize this task.

answered Oct 14 '12 at 17:56

mfsiega

2,852
19
22

2

Mapreduce is overly inefficient to calculate PageRank – Thomas Jungblut Oct 14 '12 at 18:34
1

[Data-Intensive Text Processing with MapReduce](http://lintool.github.com/MapReduceAlgorithms/index.html) has a lot of MapReduce Algorithms including the PageRank. As mentioned by others, MapReduce is an not an efficient way to do the PageRank. This [paper](http://arxiv.org/abs/1203.2081) compares MapReduce and BSP. – Praveen Sripati Oct 15 '12 at 01:20

How is pagerank calculated in a distributed way?

3 Answers3