Java String Memory Performance Improvement for Text File Parser

Question

I'm trying to optimize my memory usage when dealing with many (millions) of strings. I have a file of ~1.5 million lines that I'm reading through that contains decimal numbers in columns. For example, a file may look like

16916576.643 4 -12312674.246 4        39.785 4  16916584.123 3  -5937726.325 3
    36.794 3
16399226.418 6  -4129008.232 6        43.280 6  16399225.374 4  -1891751.787 4
    39.885 4
12415561.671 9 -33057782.339 9        52.412 9  12415567.518 8 -25595925.487 8
    49.950 8
15523362.628 5 -12597312.619 5        40.579 5  15523369.553 5  -9739990.371 5
    42.003 5
12369614.129 8 -28797729.913 8        50.068 8         0.000           0.000  
     0.000  
....

Currently I'm using String.split("\\s+") to separate these numbers, then calling Double.parseDouble() on each one of the parts of the String[], which looks something like:

String[] data = line.split("\\s+");
double firstValue = Double.parseDouble(data[0]);
double secondValue = Double.parseDouble(data[1]);
double thirdValue = Double.parseDouble(data[2]);

This ends up creating a lot of String objects. I also may have whitespace at the beginning or end of the line so I have to call trim() on the line before I split it, which also creates another String object. The garbage collector disposes of these String objects, which results in slowdowns. Are there more memory efficient constructs in Java to do this? I was thinking about using a char[] instead of a String but I'm not sure whether there would be a substantial improvement from that.

What kind of problems are you encountering with the current implementation? — biziclop, Jul 23 '15 at 15:23
Are you reading the entire file into memory? You should read one line at a time, parse it, and write it to a database. — Sualeh Fatehi, Jul 23 '15 at 15:25
@SualehFatehi no I'm storing the whole file in memory. I'm reading line-by-line and parsing. — David M, Jul 23 '15 at 15:27
@biziclop it's using 100MB of RAM to parse and it's creating about 700,000 `String` objects — David M, Jul 23 '15 at 15:28
@DavidM You could consider reading the file using BufferedReader. It wont load the entire file at once, but reads it line by line. Would this be a problem in the application? — Some guy, Jul 23 '15 at 15:29
@Raedwald yes but the garbage collector is called a lot which results in slowdowns. This is on Android so creating that many objects fills the heap space I'm given, gc gets called, fills heap space, gets gc, ... — David M, Jul 23 '15 at 15:32
@MarkW as my last comment, it's filling the available heap space on Android multiple times. That's the main concern — David M, Jul 23 '15 at 15:33
@DavidM Okay, so the problem is the GC slowdown. Now we're getting somewhere. :) — biziclop, Jul 23 '15 at 15:33
If youre concerned about the GC time, you can tweak the GC cpu time ratio using JVM args. You can also consider using a StringBuffer instead of the strings, which I believe is backed by a character array, and would be more memory efficient. — Mark W, Jul 23 '15 at 15:35
but Double.parseDouble() requires a String param, therefore when not using Strings you would have to implement this for yourself. — wero, Jul 23 '15 at 15:38
If very hard pressed, I'd probably implement a character-by-character parser, which does the double parsing in one step. If you know you're going to have base 10 numbers and no scientific notation, it is relatively simple. But I'd try to reduce other overhead first, there's no need for the `trim()` for example. — biziclop, Jul 23 '15 at 15:40
@biziclop I just ran DDMS to get a traceview. `parseDouble` is taking 38% of total execution time and `split` takes 15% of my execution time. Those are my biggest chunks — David M, Jul 23 '15 at 15:46
You could have one character array and then just reuse it. I'm not quite getting the file format as there are some integers too. — Bubletan, Jul 23 '15 at 15:46
@DavidM I also wanted to say that, depending on how exactly you are doing this, and user requirements... it might be more beneficial to offload this work as an async operation, and throttle it back to keep GC overhead lower. that coupled with a more aggressive GC strategy (set via the JVM args) may have the desired effect, if you fail to find any better solutions. — Mark W, Jul 23 '15 at 15:48
@MarkW I'm doing all this processing then displaying a plot to the user so they're waiting at essentially a loading screen until it finishes parsing everything. — David M, Jul 23 '15 at 15:49

OldCurmudgeon · Accepted Answer · 2015-07-24T08:26:59.047

If you are really certain that this is a severe bottleneck you could always parse your string directly into Doubles.

// Keep track of my state.
private static class AsDoublesState {

    // null means no number waiting for add.
    Double d = null;
    // null means not seen '.' yet.
    Double div = null;
    // Is it negative.
    boolean neg = false;

    void consume(List<Double> doubles, char ch) {
        // Digit?
        if ('0' <= ch && ch <= '9') {
            double v = ch - '0';
            if (d == null) {
                d = v;
            } else {
                d = d * 10 + v;
            }
            // Count digits after the dot.
            if (div != null) {
                div *= 10;
            }
        } else if (ch == '.') {
            // Decimal point - start tracking how much to divide by.
            div = 1.0;
        } else if (ch == '-') {
            // Negate!
            neg = true;
        } else {
            // Everything else completes the number.
            if (d != null) {
                if (neg) {
                    d = -d;
                }
                if (div != null) {
                    doubles.add(d / div);
                } else {
                    doubles.add(d);
                }
                // Clear down.
                d = null;
                div = null;
                neg = false;
            }
        }
    }
}

private static List<Double> asDoubles(String s) {
    // Grow the list.
    List<Double> doubles = new ArrayList<>();
    // Track my state.
    AsDoublesState state = new AsDoublesState();

    for (int i = 0; i < s.length(); i++) {
        state.consume(doubles, s.charAt(i));
    }
    // Pretend a space on the end to flush an remaining number.
    state.consume(doubles, ' ');
    return doubles;
}

public void test() {
    String s = "16916576.643 4 -12312674.246 4        39.785 4  16916584.123 3  -5937726.325 3    36.794 3";
    List<Double> doubles = asDoubles(s);
    System.out.println(doubles);
}

This will break badly if given bad data. E.g. 123--56...392.86 would be a perfectly valid number to it, and 6.0221413e+23 would be two numbers.

Here's an improved State using AtomicDouble to avoid creating all of those Double objects`.

// Keep track of my state.
private static class AsDoublesState {

    // Mutable double value.
    AtomicDouble d = new AtomicDouble();
    // Mutable double value.
    AtomicDouble div = new AtomicDouble();
    // Is there a number.
    boolean number = false;
    // Is it negative.
    boolean negative = false;

    void consume(List<Double> doubles, char ch) {
        // Digit?
        if ('0' <= ch && ch <= '9') {
            double v = ch - '0';
            d.set(d.get() * 10 + v);
            number = true;
            // Count digits after the dot.
            div.set(div.get() * 10);
        } else if (ch == '.') {
            // Decimal point - start tracking how much to divide by.
            div.set(1.0);
        } else if (ch == '-') {
            // Negate!
            negative = true;
        } else {
            // Everything else completes the number.
            if (number) {
                double v = d.get();
                if (negative) {
                    v = -v;
                }
                if (div.get() != 0) {
                    v = v / div.get();
                }
                doubles.add(v);
                // Clear down.
                d.set(0);
                div.set(0);
                number = false;
                negative = false;
            }
        }
    }
}

And I wouldn't try it close to the maximum/minimum range of `double` either as it probably accumulates error badly. But if all you do with the numbers is plot a graph with them, I wouldn't worry too much about it. — biziclop, Jul 23 '15 at 16:09
@biziclop - Very important point! Thanks for mentioning it. We could accumulate into a `long` or `BigInteger` and do just one divide at the end to side-step most error accumulations, or even use a `BigDecimal`. — OldCurmudgeon, Jul 23 '15 at 16:16
@biziclop There's no roundoff error as long as the number with omitted decimal point fits into 53 bits. All inputs in the questions have at most 12 digits, so they do. Using `long` would be surely better. — maaartinus, Jul 23 '15 at 16:29
You're generating much more garbage than the OP does. There's a new `double` for each input character, and after seeing the decimal point, there are two of them! — maaartinus, Jul 23 '15 at 16:31
@maaartinus Yes, it should use primitive types and an `enum` to keep track of where the parsing is (rather than relying on null checks). — biziclop, Jul 23 '15 at 16:49
I'm marking this as the answer because I based my own implementation on it. My version took parseDouble from 38% of execution time down to 5% and lowered my runtime 40 seconds. — David M, Jul 23 '15 at 17:10
@maaartinus - Agreed - I would probably use a mutable such as `AtomicDouble to take it to perfection`. — OldCurmudgeon, Jul 23 '15 at 18:04
@biziclop - I agree - a *proper* state machine would probably be a better option as we could tighten up the parsing and reject stuff like `123--55`.. — OldCurmudgeon, Jul 23 '15 at 18:06

Mikel Rychliski · Answer 2 · 2015-07-23T16:02:52.523

1

Try using Pattern and Matcher to split the string with a compiled regular expression:

double[][] out = new double[2][2];
String[] data = new String[2];
data[0] = "1 2";
data[1] = "3 2";

Pattern pat = Pattern.compile("\\s*(\\d+\\.?\\d*)?\\s+?(\\d+\\.?\\d*)?\\s*");
Matcher mat = pat.matcher(data[0]);
mat.find();

out[0][0] = Double.parseDouble(mat.group(1));
out[0][1] = Double.parseDouble(mat.group(2));

mat = pat.matcher(data[1]);
mat.find();
out[1][0] = Double.parseDouble(mat.group(1));
out[1][1] = Double.parseDouble(mat.group(2));

edited Jul 23 '15 at 16:02

answered Jul 23 '15 at 15:35

Mikel Rychliski

3,455
5
22
29

This doesn't convert the values into doubles. But it eliminates the need for calling `trim()`, so it's a good step forward. – biziclop Jul 23 '15 at 15:44
Yes that's correct, should have mentioned I didn't test that code segment.. Fixed the answer – Mikel Rychliski Jul 23 '15 at 16:04
You produce one less garbage object per line using `Matcher#reset`. – maaartinus Jul 23 '15 at 16:21

score 1 · Answer 3 · edited May 23 '17 at 11:51

We have similar problem in our application where lot of strings getting created and we did few things which help us fixing the issue.

Give more memory to Java if available e.g. -Xmx2G for 2gb.
If you're on 32 bit JVM then you only assign up to 4 gb - theoratical limit. So move to 64 bit.
Profile your application

You need to do it step by step:

Start with visualvm (click here for detail) and measure how many String, Double objects are getting created, size, time etc.
Use one of the Flyweight pattern and intern string objects. Guava library has Interner. Even, you can do even double also. This will avoid duplicate and cache the object using weak references, e.g. here

Interner<String> interner = Interners.newWeakInterner(); String a = interner.intern(getStringFromCsv()); String b = interner.intern(getStringFromCsv());

Copied from here

Profile your application again.

You can also use scanner to read double from file, your code will be cleaner and scanner also cache double values as it uses Double.ValueOf.

Here's the code

File file = new File("double_file.txt");
        Scanner scan = new Scanner(file);
        while(scan.hasNextDouble()) {
            System.out.println( scan.nextDouble() );
        }
        scan.close();

You can use this code and see if there is any GAIN in the performance or not.

I'm actually doing this on Android so I'm not able to give myself more memory, unfortunately. Just did a profile using DDMS and `parseDouble` takes 38% of my execution time and `split` takes 15, my biggest chunks. — David M, Jul 23 '15 at 15:47

score 1 · Answer 4 · edited May 23 '17 at 12:21

You don't have to produce any garbage at all:

Use BufferedInputStream in order to avoid char[] and String creation. There are no non-ASCII characters, so you deal with bytes directly.
Write a parser similar to this solution, but avoiding any garbage.
Let your class provide a method like double nextDouble(), which reads characters until a next double is assembled.
If you need a line-wise processing, watch for \n (ignore \r as it's just a needless addendum to \n; lone \r used to be used as line separator long time ago).

Java String Memory Performance Improvement for Text File Parser

4 Answers4