3

After reading theory of PageRank algorithm from this site I would like to play with it. I am trying to implement this in Java. I mean I would like to play with PageRank in detail (like giving different weights and so on). For this I need to build hyperlink matrix. If I have 1 million nodes then my hyperlink matrix will be 1 million x 1 million size, which causes this exception:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at WebGraph.main(WebGraph.java:6)

How can I implement PageRank in Java, is there any way of storing hyperlink matrix?

torayeff
  • 9,296
  • 19
  • 69
  • 103
  • Where have you already looked? Have you found any non-open source implementations? Have you considered implementing it yourself? Do you have any preferences for language? – acattle Nov 04 '12 at 05:40
  • @acattle I have looked at Jung and WebLA. I would like to focus on theory rather than in implementation. Language preferences: any. – torayeff Nov 04 '12 at 05:46
  • Have you tried increasing your heap size to get rid of that exception? – Dan W Nov 12 '12 at 17:00
  • @DanW How can I do that? – torayeff Nov 12 '12 at 17:11
  • you need a library which is good at storing sparse matrices Matlab has such a library (in c++ I believe) which you could borrow, I believe, and link it in your java code. Increasing the heap size is just a bandage solution unless you deal with small graphs/matrices. Maybe use this in java http://introcs.cs.princeton.edu/java/44st/SparseMatrix.java.html – Adrian Nov 13 '12 at 16:49
  • Voting to close as too broad. – Ciro Santilli OurBigBook.com Sep 10 '15 at 10:20

7 Answers7

8

That is a great article to learn about pagerank. I implemented a Perl version from it here to use with Textrank. However, if you want to just learn about pagerank and how the various aspects discussed in the article affect the results (dampening factor, direct or undirected graph, etc.), I would recommend running experiments in R or Octave. If you want to learn how to implement it efficiently, then programming it up from scratch, as you are doing, is best.

Most web graphs (or networks) are very sparse, which means most of the entries in the matrix representation of the graph are zero. A common data structure used to represent a sparse matrix is a hash-map, where the zero values are not stored. For example, if the matrix was

1, 0, 0
0, 0, 2,
0, 3, 0

a two dimension hash-map would store only the values for hm(0,0)=1, hm(1,2)=2, and hm(2,1)=3. So in a 1,000,000 by 1,000,000 matrix of a web graph, I would expect only a few million values to be non-zero. If each row averages only 5 non-zero values, a hash-map will use about 5*(8+8+8)10^6 bytes ~ 115mb to store it (8 for the left int index, 8 for the right int index, and 8 for the double value). The square matrix will use 810^6*10^6 ~ 7 terabytes.

Implementing an efficient sparse matrix-vector multiply in Java is not trivial, and there are some already implemented if you don't want to devote time to that aspect of the algorithm. The sparse-matrix multiply is the most difficult aspect to implement of the pagerank algorithm, so after that it gets easier (and more interesting).

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Jeff Kubina
  • 800
  • 4
  • 15
4

Python networkx module has a nice implementation of pagerank. It uses scipy/numpy for the matrix implementation. The below two questions on stackoverflow should be enough to get you started.

Community
  • 1
  • 1
greeness
  • 15,956
  • 5
  • 50
  • 80
2

A few suggestions:

  • Use python, not Java: python is an excellent prototyping language, and has available sparse matrices (in scipy) as well as many other goodies. As others have noted, it also has a pagerank implementation.

  • Store your data not all in memory: any type of lightweight database would be fine, for instance sqlite, hibernate, ...

  • Work on tiles of the data: if there is a big matrix NxN, break it up into small tiles MxM where M is a fraction of N, that fit in memory. Combined with sparse matrices this allows you to work with really big N (hundreds of millions to billions, depending on how sparse the data is).

Alex I
  • 19,689
  • 9
  • 86
  • 158
0

As Dan W suggested, try to increase the heap size. If you run your Java application from the command line, just add the switch -Xmx with the desired heap size. Let's assume you compiled your Java code into a runnable JAR file called pagerank.jar, and you want to set your heap size to 512 MB, you would issue the following command:

java -jar -Xmx512m pagerank.jar

EDIT: But that only works if you don't have that many "pages" ... A 1 Million x 1 Million array is too big to fit into your RAM (1 trillion times * 64 bit double value = 7.27595761 terabytes). You should change your algorithm to load chunks of data from the disk, manipulate it, and store it back to disk.

You could use a graph database like Neo4j for that purpose.

das_weezul
  • 6,082
  • 2
  • 28
  • 33
  • I have WebGraph.class, so I ran: java -Xmx2048m WebGraph, but it still gives OutOfMemoryError – torayeff Nov 12 '12 at 17:28
  • @torayeff: See my edited answer. 1 Million * 1 Million nodes = 1 Trillion cells. If you use a double array (a double is probably 64 bit) that would be more than 7 terabytes, which is probably more than your RAM can handle. – das_weezul Nov 12 '12 at 17:49
0

PageRank is performed by Google using the 'Pregel' BSP (really just keywords) framework.

I remembered Apache Giraph (another Pregel), which includes a version of PageRank in its benchmark package.

Here's a video about Giraph: it's an introduction, and it specifically talks about handling PageRank.

If that doesn't work:

In Java there is an implementation of Pregel called GoldenOrb.

Pseudo code for the PageRank algorithm is here (on a different implementation of Pregel).

You'll have to read around BSP, and PageRank to handle the size of data you have.

Tom
  • 830
  • 1
  • 5
  • 13
0

You don't have to store the whole 1000000x1000000 matrix, because most matrix entries will be zero. Instead, you can (for example) store a list of nonzero entries for each row, and write your matrix functions to use it directly, without expanding it into a full matrix.

This kind of compressed representation is called a sparse matrix format, and most matrix libraries have an option to build and work with sparse matrices.

One disadvantage with sparse matrices is that multiplying two of them will result in a matrix which is much less sparse. However, the PageRank algorithm is designed so that you don't need to do that: the hyperlink matrix is constant, and only the score vector is updated.

comingstorm
  • 25,557
  • 3
  • 43
  • 67
0

Because the matrix is sparse you can implement dimensionality reduction like svd,pca,mds or Lsi that includes svd. There is a library to implement this kind of processes which is called Jama. You can find it here

Lefteris Bab
  • 787
  • 9
  • 19