2

Sorry if this has been asked before (though I can't really find a solution).

I'm not really too good at programming, but anyways, I am crawling a bunch of websites and storing information about them on a server. I need a java program to process vector coordinates associated with each of the documents (about a billion or so documents with a grant total of 500,000 numbers, plus or minus, associated with each of the documents). I need to calculate the singular value decomposition of that whole matrix.

Now Java, obviously, can't handle as big of a matrix as that to my knowledge. If i try making a relatively small array (about 44 million big) then I will get a heap error. I use eclipse, and so I tried changing the -xmx value to 1024m (it won't go any higher for some reason even though I have a computer with 8gb of ram).

What solution is there to this? Another way of retrieving the data I need? Calculating the SVD in a different way? Using a different programming language to do this?

EDIT: Just for right now, pretend there are a billion entries with 3 words associated with each. I am setting the Xmx and Xms correctly (from run configurations in eclipse -> this is the equivalent to running java -XmsXXXX -XmxXXXX ...... in command prompt)

Sidd Singal
  • 559
  • 6
  • 16
  • 1
    Java can most definitely handle that. We have JVMs that grow to 32 gigs in size. You might be passing the -Xmx argument incorrectly, or something else is happening. – Andrew T Finnell Aug 06 '12 at 17:41
  • How are you setting the Xmx? Don't mix up the Xmx setting for running Eclipse with the Xmx setting for the actual program, they're two different things. – pcalcao Aug 06 '12 at 17:42
  • 1
    A billion documents? With 500000 numbers associated with each document...No, you're not going to fit that in 8GB of RAM. At four bytes per `int`, that comes out to 1.7 petabytes. Come back when you have a data center with hundreds of computers. – Louis Wasserman Aug 06 '12 at 17:43
  • You may find that reconsidering your approach to the situation will do you better. If possible, incrementally calculate it. – FThompson Aug 06 '12 at 17:43
  • @pcalcao sorry I changed them both. Still need more space regardless. – Sidd Singal Aug 06 '12 at 17:47
  • @LouisWasserman That is correct! But I can find a way to compress the data, but the dimensions in the array will still have to be atleast the number of documents. – Sidd Singal Aug 06 '12 at 17:50
  • @Sidd: you really need to reconsider your design. Depending on what exactly you need to do, you may also need to reconsider your expectations. Even with an adjacency list, you need at least 12GB to store the *links* between 1,000,000,000 documents with 3 links each on average. Why do you think Google has so much hardware? – thkala Aug 06 '12 at 18:10
  • The place I am working at has the hardware I need to store that much information. Getting the amount of RAM I need for this though seems to be the problem. After reading all of these comments and answers, I think the best solution is to find an incremental way of calculating the SVD like I need, rather than to store the whole matrix at once and calculate the SVD as such. – Sidd Singal Aug 06 '12 at 18:14

6 Answers6

2

The Java heap space can be set with the -Xmx (note the initial capital X) option and it can certainly reach far more than 1 GB, provided you are using an 64-bit JVM and the corresponding physical memory is available. You should try something along the lines of:

java -Xmx6144m ...

That said, you need to reconsider your design. There is a significant space cost associated with each object, with a typical minimum somewhere around 12 to 16 bytes per object, depending on your JVM. For example, a String has an overhead of about 36-40 bytes...

Even with a single object per document with no book-keeping overhead (impossible!), you just do not have the memory for 1 billion (1,000,000,000) documents. Even for a single int per document you need about 4 GB.

You should re-design your application to make use of any sparseness in the matrix, and possibly to make use of disk-based storage when possible. Having everything in memory is nice, but not always possible...

thkala
  • 84,049
  • 23
  • 157
  • 201
  • This also probably requires a 64-bit PC/JVM, yes? I think the max heap size on 32-bit would be a little under 2GB. – Carl Aug 06 '12 at 17:45
  • He mentions 1bn documents with 500k records each, that's 500 trillion records. Even if each of them only uses one byte, it won't work... – assylias Aug 06 '12 at 17:46
  • 1
    @Carl: Actually, I think the max heap space for 32 bit is around 4GB (2^32) – Chris Dargis Aug 06 '12 at 17:48
  • @DougRamsey: not on Windows 32-bit, it isn't - the maximum on that OS is about 1.5GB... – thkala Aug 06 '12 at 18:02
  • @DougRamsey, fair enough. I think it's also platform dependent and Windows might restrict it under 4GB. Good discussion here: http://stackoverflow.com/questions/1434779/maximum-java-heap-size-of-a-32-bit-jvm-on-a-64-bit-os Also shows a nice trick: `Runtime.getRuntime().maxMemory()` will show what you currently are able to use. – Carl Aug 06 '12 at 18:02
  • BTW `-mx` is the same as `-Xmx` except `-mx` is a standard, but poorly documented option. – Peter Lawrey Aug 06 '12 at 18:09
  • 1520435200...I can't run anymore than 1500m for Xmx. – Sidd Singal Aug 06 '12 at 18:09
  • 1
    @Sidd: What OS are you using, anyway? – thkala Aug 06 '12 at 18:12
  • @thkala I am using windows 32-bit – Sidd Singal Aug 06 '12 at 18:16
  • @Sidd: Then half of those 8 GB of RAM in your system are completely wasted - Windows 32 bit will not use over 3.5 GB or so... – thkala Aug 06 '12 at 18:20
  • @thkala: I wonder, can any 32-bit OS address that much memory? – Chris Dargis Aug 06 '12 at 18:24
  • @thkala Hmm I did not know that. Unfortunately I am not allowed to change the OS associated with the computer I am working with. This can be something I will tinker with on my home laptop. Thanks for that advice. – Sidd Singal Aug 06 '12 at 18:24
  • 1
    @DougRamsey: Linux 32-bit can address more than 4GB using the PAE extension, *but* individual user processes are still limited by the use of 32-bit pointers and by the 1/3 (or 2/2) address space split... – thkala Aug 06 '12 at 19:10
  • @thkala: I wasn't aware of Physical Address Extensions...thanks! – Chris Dargis Aug 06 '12 at 19:15
2

Are you using a 32 bit JVM? These cannot have more than 2 GB of Heap, I never managed to allocate more than 1.5 GB. Instead, use a 64 bit JVM, as these can allocate much more Heap.

Christian Semrau
  • 8,913
  • 2
  • 32
  • 39
  • At this point 64-bit jvm isnt possible for me, I am only running a 32-bit computer. But I did not know about the 2 GB being max, thanks for that advice. I guess the solution just breaks down to how I handle the mathematics part of this. – Sidd Singal Aug 06 '12 at 17:55
  • @Sidd: a 32-bit computer with 8 GB of memory? Are you sure about that? Perhaps you are just running a 32-bit OS? – thkala Aug 06 '12 at 17:58
0

Or you could apply some math to it and use divide and conquer strategy. This means, split the problem into little problems to get to the same result.

Don't know much about SVD but maybe this page can be helpful:

http://www.netlib.org/lapack/lug/node32.html

Gonzalo
  • 1,126
  • 2
  • 12
  • 21
0

-Xms and -Xmx are different. The one containg s is the starting heap space and the one with x is the maximum heap space.

so

java -Xms512 -Xmx1024

would give you 512 to start with

As other people have said though you may need to break your problem down to get this to work. Are you using 32 or 64 bit java?

RNJ
  • 15,272
  • 18
  • 86
  • 131
0

For data of that size, you should not plan to store it all in memory. The most common scheme to externalize this kind of data is to store it all in a database and structure your program around database queries.

ddyer
  • 1,792
  • 19
  • 26
  • Databases and computer algebra systems do not mix very well - too many random queries. The latency of the database is generally a performance killer... – thkala Aug 06 '12 at 18:00
0

Just for right now, pretend there are a billion entries with 3 words associated with each.

If you have one billion entries you need 1 billion times the size of each entry. If you mean 3 x int as words that's 12 GB at least just for the data. If you meant words as String, you would enumerate the words as there is only about 100K words in English and it would take the same amount of space.

Given 16 GB cost a few hundred dollars, I would suggest buying more memory.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130