How to parse a huge file line by line, serialize & deserialize a huge object efficiently?

Question

I have a file of size around 4-5 Gigs(nearly billion lines). From every line of the file, I have to parse the array of integers and the additional integer info and update my custom data structure. My class to hold such information looks like

class Holder {
    private int[][] arr = new int[1000000000][5]; // assuming that max array size is 5
    private int[] meta = new int[1000000000];
}

A sample line from the file looks like

(1_23_4_55)    99

Every index in the arr & meta corresponds to the line number in the file. From the above line, I extract the array of integers first and then the meta information. In that case,

--pseudo_code--
arr[line_num] = new int[]{1, 23, 4, 55}
meta[line_num]=99

Right now, I am using BufferedReader object and it's readLine method to read each line & use character level operations to parse the integer array and meta information from each line and populate the Holder instance. But, it takes almost half an hour to complete this entire operation.

I used both java Serialization & Externalizable(write the meta and arr) to serialize and deserialize this HUGE Holder instance. And with both of them, the time to serialize is almost half an hour and to deserialize is also almost half an hour.

I would appreciate your suggestions on dealing with this kind of problem & would definitely love to hear your part of story if any.

P.S. Main Memory is not a problem. I have almost 50 GB of RAM in my machine. I have also increased the BufferedReader size to 40 MB (Of course, I can increase this upto 100 MB considering that disk access takes approx. 100 MB/sec). Even cores and CPU is not a problem.

EDIT I

The code that I am using to do this task is provided below(after anonymizing very few information);

public class BigFileParser {

private int parsePositiveInt(final String s) {
    int num = 0;
    int sign = -1;
    final int len = s.length();
    final char ch = s.charAt(0);
    if (ch == '-')
        sign = 1;
    else
        num = '0' - ch;

    int i = 1;
    while (i < len)
        num = num * 10 + '0' - s.charAt(i++);

    return sign * num;
}

private void loadBigFile() {
    long startTime = System.nanoTime();
    Holder holder = new Holder();
    String line;
    try {

        Reader fReader = new FileReader("/path/to/BIG/file");
        // 40 MB buffer size
        BufferedReader bufferedReader = new BufferedReader(fReader, 40960);
        String tempTerm;
        int i, meta, ascii, len;
        boolean consumeNextInteger;
        // GNU Trove primitive int array list
        TIntArrayList arr;
        char c;
        while ((line = bufferedReader.readLine()) != null) {
            consumeNextInteger = true;
            tempTerm = "";
            arr = new TIntArrayList(5);
            for (i = 0, len = line.length(); i < len; i++) {
                c = line.charAt(i);
                ascii = c - 0;
                // 95 is the ascii value of _ char
                if (consumeNextInteger && ascii == 95) {
                    arr.add(parsePositiveInt(tempTerm));
                    tempTerm = "";
                } else if (ascii >= 48 && ascii <= 57) { // '0' - '9'
                    tempTerm += c;
                } else if (ascii == 9) { // '\t'
                    arr.add(parsePositiveInt(tempTerm));
                    consumeNextInteger = false;
                    tempTerm = "";
                }
            }

            meta = parsePositiveInt(tempTerm);
            holder.update(arr, meta);
        }
        bufferedReader.close();
        long endTime = System.nanoTime();
        System.out.println("@time -> " + (endTime - startTime) * 1.0
                / 1000000000 + " seconds");
    } catch (IOException exp) {
        exp.printStackTrace();
    }
}
}

public class Holder {
    private static final int SIZE = 500000000;

    private TIntArrayList[] arrs;
    private TIntArrayList metas;
    private int idx;

    public Holder() {
        arrs = new TIntArrayList[SIZE];
        metas = new TIntArrayList(SIZE);
        idx = 0;
    }

    public void update(TIntArrayList arr, int meta) {
        arrs[idx] = arr;
        metas.add(meta);
        idx++;
    }
}

Serialization isn't much of an improvement, if any, over the original file. You should be using a database for this amount of data. You certainly should *not* be attempting to represent the entire file in memory. — user207421, Apr 18 '14 at 10:08
@EJP I need to use those integer arrays and associated metadata information later on in my program. I presume, using DB for storing & accessing later on would slow down the system. Performance during runtime is very important in my case. I can't test my changes fast because this whole loading (deserialization) process takes too much time. — consumer, Apr 18 '14 at 10:12
The problem here is certainly not in BufferedReader.readLine(). You can read millions of lines a second with that. It's in your other code. And there is certainly nothing significantly faster. — user207421, Apr 18 '14 at 10:19
Did you try to multithread? i.e create few threads and each thread can start reading from a specific line number by skipping over the other (which are read by remaining threads) — suman j, Apr 18 '14 at 12:50
@Jack Hmm, I have not tried multithreading yet. Thanks for pointing that out. :) — consumer, Apr 18 '14 at 15:02
@Jack But I also think that synchronization overhead (for the `Holder` members) will result in slow performance even in case of multi threaded read. — consumer, Apr 18 '14 at 15:11
@Consumer They need not be synchronized. Program can use different instance of Holder class for each thread and finally combine. It depends on the requirement though. — suman j, Apr 18 '14 at 15:16

Russell Zahniser · Answer 1 · 2014-04-18T12:58:52.423

2

It sounds like the time taken for file I/O is the main limiting factor, given that serialization (binary format) and your own custom format take about the same time.

Therefore, the best thing you can do is to reduce the size of the file. If your numbers are generally small, then you could get a huge boost from using Google protocol buffers, which will encode small integers generally in one or two bytes.

Or, if you know that all your numbers are in the 0-255 range, you could use a byte[] rather than int[] and cut the size (and hence load time) to a quarter of what it is now. (assuming you go back to serialization or just write to a ByteChannel)

edited Apr 18 '14 at 12:58

answered Apr 18 '14 at 12:52

Russell Zahniser

16,188
39
30

My number ranges from 0 to around 4 million. So byte[] won't help me either. But I will definitely look into the points you have mentioned. – consumer Apr 18 '14 at 15:05
@consumer: I don't think that it can be IO, writing 24 GB just can't take that long. So I don't think any compression can help. Either there's some swapping, or a GC problem, or very bad memory locality, or something like this. – maaartinus Apr 18 '14 at 15:09

score 1 · Answer 2 · edited May 23 '17 at 11:49

1

If you randomly pause it you will probably see that the bulk of the time goes into parsing the integers, and/or all the new-ing, as in new int[]{1, 23, 4, 55}. You should be able to just allocate the memory once and stick numbers into it at better than I/O speed if you code it carefully.

But there's another way - why is the file in ASCII? If it were in binary, you could just slurp it up.

edited May 23 '17 at 11:49

Community

1
1

answered Apr 18 '14 at 12:34

Mike Dunlavey

40,059
14
91
135

This seems unlikely - parsing an int should be much faster than I/O, and the poster says that serialization (binary format) is just as slow -- presumably because the above format takes around 16 bytes per line and serialization takes 20 for 5 ints) – Russell Zahniser Apr 18 '14 at 12:56
@Russell: Problem is, he's doing about 10^9 `new`s. Crazy :) Plus, I wouldn't trust serialization to be as fast as pure binary read, because it views itself as building data structure (lots of `new`s). It's designed to save you the trouble of writing serialization yourself, not to be blindingly fast on ginormous files. – Mike Dunlavey Apr 18 '14 at 15:55

score 1 · Accepted Answer · answered Apr 18 '14 at 14:18

1

It simply can't take that long. You're working with some 6e9 ints, which means 24 GB. Writing 24 GB to the disk takes some time, but nothing like half an hour.

I'd put all the data in a single one-dimensional array and access it via methods like int getArr(int row, int col) which transform row and col onto a single index. According to how the array gets accessed (usually row-wise or usually column-wise), this index would be computed as N * row + col or N * col + row to maximize locality. I'd also store meta in the same array.

Writing a single huge int[] into memory should be pretty fast, surely no half an hour.

Because of the data amount, the above doesn't work as you can't have a 6e9 entries array. But you can use a couple of big arrays instead and all of the above applies (compute a long index from row and col and split it into two ints for accessing the 2D-array).

Make sure you aren't swapping. Swapping is the most probable reason for the slow speed I can think of.

answered Apr 18 '14 at 14:18

maaartinus

44,714
32
161
320

The sizes of `arr` across lines are not uniform. So I won't be able to put them easily in a single 1D array. I will have to encode extra information about size for every `arr`. Even though, this will be a slight(if length is added for every line, overhead = num_lines * 4Byte) memory overkill, I see a brighter side of just int array serialization/deserialization. I also added the source code that i am using right now just in case. :) – consumer Apr 19 '14 at 23:15
@consumer: I can't see any big problem. You're creating unnecessary strings in `tempTerm` (a `StringBuilder` would do), but there are small, so it can't be that bad. Actually, you could write a method for reading the number from a given offset (so you'd need no temporary allocation besides `readLine`), but IMHO this isn't important. – maaartinus Apr 19 '14 at 23:31
Please make sure you're not swapping. Also check that you're not using much more memory than you think. How does the `TIntArrayList` grow? Does it have a shrink method? I'd suggest to really use `int[][] arrs` instead... it won't grow anymore and there's no reason in wasting memory for another object (`int[5]` takes 20 bytes, `TIntArrayList` with 5 elements may take easily twice as much). – maaartinus Apr 19 '14 at 23:35
That's almost exactly the code I have used except some variable names and format changes. Swapping is not used at any place whatsoever. TInTArrayList is an array backed list of int primitives. So besides an int array, it will have few meta data stored(still this is not much of a memory headache i suppose). Creating a TIntArrayList object creation takes time but with that many lines in the file, I have not checked it's impact yet. TIntArrayList creation overhead would be the last thing I would want to hear. :D – consumer Apr 19 '14 at 23:45
*"Swapping is not used at any place whatsoever."* - Just to be sure, I'm concerned if your OS doesn't swap. Check it e.g. via [top](http://unixhelp.ed.ac.uk/CGI/man-cgi?top) and also look how much CPU the process uses. I wasn't concerned with the `TIntArrayList` creation time, but with its memory consumption (see above). Creating a billion of them is no problem, storing a billion of objects taking twice as much memory may well be. I'd use them for reading as it's comfortable, but I'd extract the content in an `int[]` as it won't grow anymore. – maaartinus Apr 20 '14 at 00:03
1

I just used a 1D array instead of 2D array as you suggested. Now deserialization just takes 13 seconds(very much surprising). I forgot to note the exact time for serialization, but that was also somewhere near a minute or so. So it appears that using a 2D array impacts the serialization/deserialization performance(still don't know the exact reason why). Thank you. – consumer Apr 26 '14 at 14:13

score 1 · Answer 4 · answered Apr 18 '14 at 14:36

1

There are several alternative Java file i/o libraries. This article is a little old, but it gives an overview that's still generally valid. He's reading about 300Mb per second with a 6-year old Mac. So for 4Gb you have under 15 seconds of read time. Of course my experience is that Mac IO channels are very good. YMMV if you have a cheap PC.

Note there is no advantage above a buffer size of 4K or so. In fact you're more likely to cause thrashing with a big buffer, so don't do that.

The implication is that parsing characters into the data you need is the bottleneck.

I have found in other apps that reading into a block of bytes and writing C-like code to extract what I need goes faster than the built-in Java mechanisms like split and regular expressions.

If that still isn't fast enough, you'd have to fall back to a native C extension.

answered Apr 18 '14 at 14:36

Gene

46,253
4
58
96

Thanks for pointing to the wonderful link. I have also tried reading big block of data at once and using primitive types in java to extract those integer arrays & meta data from each line. But even that didn't improve performance by a significant margin. Writing a native C extension might be the last resort if there's no other way to deal with this kind of problem. – consumer Apr 18 '14 at 15:00
1

@consumer: A buffer bigger than a few kB means it doesn't fit in L2. Once I benchmarked it and found out that 8 kB is usually optimal, a factor 2 or 4 bigger or smaller is no problem, and much bigger makes no sense. Concerning C I strongly disagree, you can't win much. While C might be sometimes some 20% faster, there's the JNI overhead which surely eats more. Note that the JVM started with a lot of native methods and many of them were moved to Java as JIT improved. – maaartinus Apr 18 '14 at 15:27
@maaartinus I don't see how L2 cache matters much when the system ought to be writing once into the buffer and reading once to convert. Cache matters when data are touched many times before replacement. Efficient buffer size tends to match disk block size. Something is very much out of whack. If 15 seconds is sufficient to read the data, the parsing code must be spectacularly inefficient for the total run time to be 30 minutes. – Gene Apr 19 '14 at 01:26
@Gene: I'm not claiming that the large buffer could be the cause for the OP's problem. Just that they are slower than properly sized ones and that the reason might be the cache (but this is just a guess). In my measurements insanely huge buffer were slower by maybe 50%, not more. So forget it, we surely agree that there's something terribly wrong in the OP's case. – maaartinus Apr 19 '14 at 11:57
1

@consumer As we have been discussing, if you post your parsing code somewhere, we can look at it. There must be something broken there. Either it's that or your machine is a real dog in spite of the large RAM. – Gene Apr 19 '14 at 16:03
@Gene Well I have just added the source code too. Thank you. – consumer Apr 19 '14 at 23:09
@consumer: If nothing helps, you could create a self-contained runnable example. It should generate the datafile (all zeros should do) and read it afterwards. With a possibility for everybody to try it out, a solution is guaranteed. – maaartinus Apr 20 '14 at 01:39

How to parse a huge file line by line, serialize & deserialize a huge object efficiently?

4 Answers4