2

I read a dictionary that might be 100MB or so in size (sometimes gets bigger up to max 500MB). It is a simple dictionary of two columns, the first column words the second column a float value. I read the dictionary file it in this way:

BufferedReader br = new BufferedReader(new FileReader(file));
        String line;
        while((line = br.readLine()) != null) {
            String[] cols = line.split("\t");
            setIt(cols[0], cols[1]);

and for the setIt function:

public void setIt(String term, String value) {
    all.put(term, new Double(value));
    }

When I have a big file, it takes a long time to load it and it often goes out of memory. Even with a reasonable size file (100MB) it does need a 4GB memory in Java to be run.

Any clue how to improve it while not changing the structure of the whole package?

EDIT: I'm using a 50MB file with -Xmx1g and I still get the error.

UPDATE: There were some iterations over the file that I fixed them and now the memory problem was partially solved. Yet to try the properties and other solutions and report on that.

Nick
  • 367
  • 4
  • 6
  • 13
  • Have you tried commenting out the while() loop? That would tell you if the problem is in your BufferedReader, or if it comes later in setIt. Might help narrow down the search. – Barry Gackle May 15 '15 at 04:15
  • When I don't use any dictionary the code will run very fast. as soon as I read a dictionary, it will give me the heap error. – Nick May 15 '15 at 04:19
  • @Nick, is this an assignment? If not you may try using Properties file and you need not loop and build map for this. You may simply use [Properties](http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html). –  May 15 '15 at 04:22
  • @Nick, what does the all.put() function do? – Barry Gackle May 15 '15 at 04:27
  • @Arvind, this is not an assignment. I have a code and try to optimize it. Wanted to figure out how can I do that. – Nick May 15 '15 at 04:28
  • @BarryGackle, it's a hash-map with all strings and their value. – Nick May 15 '15 at 04:32
  • Have you considered using some software that was designed for this sort of thing, e.g. Redis? – Chris Martin May 15 '15 at 04:39
  • `100 mb` doesn't say very much yet. You are creating objects per line from the file, and `100 mb` could be one line or 50 million lines. Creating an object has an overhead, in many cases 12-16 bytes per object (http://stackoverflow.com/questions/17335884/object-header-size-in-java-on-64bit-vm-with-4gb-ram). How many lines does the file contain? – Erwin Bolwidt May 15 '15 at 05:13
  • 2845315 lines. Each line is a word and a double value with tab separating them. – Nick May 15 '15 at 06:32

4 Answers4

1

You are allocating a new String for every line. There is some overhead associated with a String. See Here for a calculation. This article also addresses the subject of object memory use in java.

There is a stack overflow question on the subject of more memory efficient replacements for strings here.

Is there something you can do to avoid all those allocations? For example, are there a limited number of strings that you could represent as an integer in your data structure, and then use a smaller lookup table to translate?

Community
  • 1
  • 1
Barry Gackle
  • 829
  • 4
  • 17
  • The dictionary file is only 50 mg. How can it eat up such a large memory? – Nick May 15 '15 at 04:37
  • It doesn't. As you said, you can read the file in just fine. The metadata in whatever object you are storing the data in is what is killing you. What is the size of the object you are loading the data into, and how many of those are you creating? That is the relevant number -- a few tens of bytes times a few million lines, and you have your explaination. Each string object alone might be adding 40 bytes or so. – Barry Gackle May 15 '15 at 04:42
  • Is there any way to find that out? – Nick May 15 '15 at 05:09
  • Suprisingly no. I had expected to find a Java equivalent of C/C++ sizeof() operator, but no luck. I did find this: https://code.google.com/p/javabi-sizeof/ – Barry Gackle May 15 '15 at 05:23
1

You can do a lot of things to reduce memory usage. for example :

1- replace String[] cols = line.split("\t"); with :

static final Pattern PATTERN = Pattern.compile("\t");

//...

String[] cols = PATTERN.split(line);

2- use .properties file to store your dictionary and simply load it this way :

Properties properties = new Properties();

//...

try (FileInputStream fileInputStream = new FileInputStream("D:/dictionary.properties")) {
    properties.load(fileInputStream);
}
Map<String, Double> map = new HashMap<>();
Enumeration<?> enumeration = properties.propertyNames();
while (enumeration.hasMoreElements()){
    String key = (String) enumeration.nextElement();
    map.put(key, new Double(properties.getProperty(key)));
}

//...

dictionary.properties :

A = 1
B = 2
C = 3
//...

3- use StringTokenizer :

StringTokenizer tokenizer = new StringTokenizer(line, "\t");
setIt(tokenizer.nextToken(), tokenizer.nextToken());
FaNaJ
  • 1,329
  • 1
  • 16
  • 39
  • I can see that compiling the pattern and the other approaches you suggest might speed things up, but I don't understand why they reduce memory usage (assuming the JVM's garbage collection is working correctly). Can you explain? – ᴇʟᴇvᴀтᴇ May 15 '15 at 10:07
1

Well my solution will deviate little bit from your code ...

Use Lucene or more specifically Lucene Dictionary or even more specifically Lucene Spell Checker depends upon what you want.

Lucene handle any amount of data with efficient memory usage ..

Your problem is that you are storing whole Dictionary in memory ... Lucene store it in file with hashing and then it take search result from file at runtime but efficiently. This save lot of memory. You can customize search depends upon your needs

Small Demo of Lucene

Junaid
  • 2,572
  • 6
  • 41
  • 77
0

A few causes for this problem would be.

1). The String array cols is using up too much memory.

2). The String line might also be using too much memory, unlikely though.

3). While java is opening and reading the file its also using memory so that's also a probability.

4). Your map put will also be taking up a small amount of memory.

It might also be all these things combined, so maybe try and comment some lines out and see if works then.

The most likely cause is all these things added up is eating your memory. So a 10 megabyte file could end up being 50 megabytes. Also make sure to .close() all input steams and try to reallocate ram by splitting up your methods so variables get garbage collected.

As for doing this without changing package structure or java heap size arguments i'm not sure it will be very easy, if possible at all.

Hope this helps.

Luke Melaia
  • 1,470
  • 14
  • 22