9

I have a pipe delimited file which I parse to get system options. The environment is sensitive to heap allocation and we are trying to avoid garbage collection.

Below is the code I am using to parse the pipe delimited string. This function is called about 35000 times. I was wondering if there was a better approach that doesn't create as much memory churn.

static int countFields(String s) {
    int n = 1;
    for (int i = 0; i < s.length(); i++)
        if (s.charAt(i) == '|')
            n++;

    return n;
}

static String[] splitFields(String s) {
    String[] l = new String[countFields(s)];

    for (int pos = 0, i = 0; i < l.length; i++) {
        int end = s.indexOf('|', pos);
        if (end == -1)
            end = s.length();
        l[i] = s.substring(pos, end);
        pos = end + 1;
    }

    return l;
}

EDIT 1, about java version:

For business reasons we are stuck at JDK 1.6.0_25.

EDIT 2 about String and String[] usage:

The String[] is used to perform the system setup logic. Basically, if String[0].equals("true") then enable debugging. That's the usage pattern

EDIT 3 about garbage collected objects:

The input String and the String[] are eventually GC'd. The input String is a single line from system setup file which is GC'd after the entire file is processed and the String[] is GC'd after the entire line has been processed.

EDIT - SOLUTION:

This is a combo of Peter Lawrey and zapl's solutions. Also, this class is NOT thread safe.

public class DelimitedString {

    private static final Field EMPTY = new Field("");

    private char delimiter = '|';
    private String line = null;
    private Field field = new Field();

    public DelimitedString() { }

    public DelimitedString(char delimiter) {
        this.delimiter = delimiter;
    }

    public void set(String line) {
        this.line = line;
    }

    public int length() {
        int numberOfFields = 0;
        if (line == null)
            return numberOfFields;

        int idx = line.indexOf(delimiter);
        while (idx >= 0) {
            numberOfFields++;
            idx = line.indexOf(delimiter, idx + 1);
        }
        return ++numberOfFields;
    }

    public Field get(int fieldIndex) {
        if (line == null)
            return EMPTY;

        int currentField = 0;
        int startIndex = 0;
        while (currentField < fieldIndex) {
            startIndex = line.indexOf(delimiter, startIndex);

            // not enough fields
            if (startIndex < 0)
                return EMPTY;

            startIndex++;
            currentField++;
        }

        int endIndex = line.indexOf(delimiter, startIndex);
        if (endIndex == -1)
            endIndex = line.length();

        fieldLength = endIndex - startIndex;
        if (fieldLength == 0)
            return EMPTY;

        // Populate field
        for (int i = 0; i < fieldLength; i++) {
            char c = line.charAt(startIndex + i);
            field.bytes[i] = (byte) c;
        }
        field.fieldLength = fieldLength;
        return field;
    }

    @Override
    public String toString() {
        return new String(line + " current field = " + field.toString());
    }

    public static class Field {

        // Max size of a field
        private static final int DEFAULT_SIZE = 1024;

        private byte[] bytes = null;
        private int fieldLength = Integer.MIN_VALUE;

        public Field() {
            bytes = new byte[DEFAULT_SIZE];
            fieldLength = Integer.MIN_VALUE;
        }

        public Field(byte[] bytes) {
            set(bytes);
        }

        public Field(String str) {
            set(str.getBytes());
        }

        public void set(byte[] str) {
            int len = str.length;
            bytes = new byte[len];
            for (int i = 0; i < len; i++) {
                byte b = str[i];
                bytes[i] = b;
            }
            fieldLength = len;
        }

        public char charAt(int i) {
            return (char) bytes[i];
        }

        public byte[] getBytes() {
            return bytes;
        }

        public int length() {
            return fieldLength;
        }

        public short getShort() {
            return (short) readLong();
        }

        public int getInt() {
            return (int) readLong();
        }

        public long getLong() {
            return readLong();
        }

        @Override
        public String toString() {
            return (new String(bytes, 0, fieldLength));
        }

        // Code taken from Java class Long method parseLong()
        public long readLong() {
            int radix = 10;
            long result = 0;
            boolean negative = false;
            int i = 0, len = fieldLength;
            long limit = -Long.MAX_VALUE;
            long multmin;
            int digit;

            if (len > 0) {
                char firstChar = (char) bytes[0];
                if (firstChar < '0') { // Possible leading "-"
                    if (firstChar == '-') {
                        negative = true;
                        limit = Long.MIN_VALUE;
                    } else
                        throw new NumberFormatException("Invalid leading character.");

                    if (len == 1) // Cannot have lone "-"
                        throw new NumberFormatException("Negative sign without trailing digits.");
                    i++;
                }
                multmin = limit / radix;
                while (i < len) {
                    // Accumulating negatively avoids surprises near MAX_VALUE
                    digit = Character.digit(bytes[i++], radix);
                    if (digit < 0)
                        throw new NumberFormatException("Single digit is less than zero.");
                    if (result < multmin)
                        throw new NumberFormatException("Result is less than limit.");

                    result *= radix;
                    if (result < limit + digit)
                        throw new NumberFormatException("Result is less than limit plus new digit.");

                    result -= digit;
                }
            } else {
                throw new NumberFormatException("Called readLong with a length <= 0. len=" + len);
            }
            return negative ? result : -result;
        }
    }
}
Justin
  • 4,196
  • 4
  • 24
  • 48
  • What are you using the `String[]` for? Perhaps this can be eliminated. – Peter Lawrey Dec 02 '13 at 19:44
  • What do you do with these substrings once you get them? – Sergey Kalinichenko Dec 02 '13 at 19:44
  • 4
    Well, you're avoiding the worst thing you can do, which is to repeatedly split the string into head and successively smaller tail sections. If you need N strings, though, there's no real way to avoid creating N strings. (Used to be the JVM would "share" a single char array for substrings off the original string, but that's no longer done for a number of reasons.) – Hot Licks Dec 02 '13 at 19:44
  • 3
    @user2864740 Not since Java 7 update 4. – Peter Lawrey Dec 02 '13 at 19:44
  • As first step I suggest to hard finetuning the garbage collector of your VM. Often the parallel running isn't turned on by default. – peterh Dec 02 '13 at 19:45
  • 4
    @PeterLawrey -- The shared buffer was, for all practical purposes, dropped a lot earlier than that (Java 2 timeframe). It just took a long time to get rid of the extra offset field. – Hot Licks Dec 02 '13 at 19:46
  • @user2864740 Yes, it is worth noting that Java 6 didn't have this issue so much but Java 7 does. – Peter Lawrey Dec 02 '13 at 19:46
  • @PeterLawrey We are using JDK 1.6.0_25. – Justin Dec 02 '13 at 19:47
  • @Justin Gosh, there was almost 20 updates of Java 6 since then, not that it matters for the purposes of this question. I would avoid creating the String[] at all or use an Object pool. Its not clear from the question which is best as it depends on what you are doing with the String[] – Peter Lawrey Dec 02 '13 at 19:49
  • Do you really need String[]? Is there no better way you can parse? Is speed critical, at the cost of memory? – arynaq Dec 02 '13 at 19:49
  • for better GC look at mapping the entire stream w/ '|' to memory using NIO then use grep4j to parse pull the expressions between pipes. – Robert Rowntree Dec 02 '13 at 19:50
  • Do you use your substrings sequentially, or would it be acceptable to take them one at a time? – Sergey Kalinichenko Dec 02 '13 at 19:51
  • @PeterLawrey The `String[]` is used to perform the system setup logic. Basically, if String[0].equals("true") then enable debugging. That's the usage pattern. – Justin Dec 02 '13 at 19:52
  • @arynaq Not completely sure if I need `String[]`; preventing garbage is the highest priority of the system (besides proper biz logic). – Justin Dec 02 '13 at 19:53
  • Well, if You want to minimize object count, create two arrays of integers (for option start and option length or end). Then You have Original string and two primitive arrays, total three objects. Iterate through the string twice if you want exact array lengths, that should be negligible overhead. Then create custom methods for any operations you need to perform on the substrings. – hyde Dec 02 '13 at 19:54
  • So you could write a function which tested `s.startsWith("true|")` instead? I suspect you could eliminate the `String s` as well if you wanted. – Peter Lawrey Dec 02 '13 at 19:54
  • What part is garbage collected here? The input `String`, the output `String[]`, both and how often (i.e. if you want to get `String[0]` do you recreate that array? Is that necessary, i.e. does the input change?) – zapl Dec 02 '13 at 19:55
  • Hmmm not tested this but would using jni and getting the string from c, anyone any idea how slow that would be? iirc substringing in c just returns the (requested part) of the backing char array. – arynaq Dec 02 '13 at 19:59
  • @zapl The input `String` and the `String[]` are eventually GC'd. The input `String` is a single line from system setup file which is GC'd after the entire file is processed and the `String[]` is GC'd after the entire line has been processed. – Justin Dec 02 '13 at 20:02
  • @arynaq - Using JNI would of course not need to create any Java objects .... until you wanted to pass the strings back to Java code. Basically, if you want the strings to be created at all the above code is as good as any. You necessarily have N strings, and the cost of an additional `String[]`, if it's useful, is negligible. – Hot Licks Dec 02 '13 at 20:03
  • 1
    The only way you could do much better is to analyze the values as you parse them and only cache the analyzed values (numbers, true/false, etc), perhaps into the fields of a custom object. – Hot Licks Dec 02 '13 at 20:10
  • @dasblinkenlight The order of the substring doesn't matter but getting rid of the String[] all together would be big rewrite of the logic code. Not that re-writing the logic is a huge deal but something I have to take into consideration. – Justin Dec 02 '13 at 20:14
  • @Justin Assuming that the strings that you pass have several parts, the allocation of `String[]` is small in comparison to the allocation of individual strings. – Sergey Kalinichenko Dec 02 '13 at 20:16
  • 1
    It should be noted that a `String[]` object is no more expensive (in space and number of objects) than a `long[]` object, and it's generally not worth the effort to try to avoid creating it. It's basically one object segment that's about `(4 + N) * 8` bytes long. – Hot Licks Dec 02 '13 at 20:21

8 Answers8

6

I would do something like this.

public static void main(String[] args) throws IOException {
    BufferedReader br = new BufferedReader(new FileReader("inputfile"));
    StringBuilder sb = new StringBuilder();
    do {
        boolean flag = readBoolean(br, sb);
        long val = readLong(br, sb);
        process(flag, val);
    } while (nextLine(br));
    br.close();
}

private static void process(boolean flag, long val) {
    // do something.
}

public static boolean readBoolean(BufferedReader br, StringBuilder sb) throws IOException {
    readWord(br, sb);
    return sb.length() == 4
            && sb.charAt(0) == 't'
            && sb.charAt(1) == 'r'
            && sb.charAt(2) == 'u'
            && sb.charAt(3) == 'e';
}

public static long readLong(BufferedReader br, StringBuilder sb) throws IOException {
    readWord(br, sb);
    long val = 0;
    boolean neg = false;
    for (int i = 0; i < sb.length(); i++) {
        char ch = sb.charAt(i);
        if (ch == '-')
            neg = !neg;
        else if (ch >= '0' && ch <= '9')
            val = val * 10 + ch - '0';
        else
            throw new NumberFormatException();
    }
    return neg ? -val : val;
}

public static boolean nextLine(BufferedReader br) throws IOException {
    while (true) {
        int ch = br.read();
        if (ch < 0) return false;
        if (ch == '\n') return true;
    }
}

public static void readWord(BufferedReader br, StringBuilder sb) throws IOException {
    sb.setLength(0);
    while (true) {
        br.mark(1);
        int ch = br.read();
        switch (ch) {
            case -1:
                throw new EOFException();
            case '\n':
                br.reset();
            case '|':
                return;
            default:
                sb.append((char) ch);
        }
    }
}

This is waaaaay more complicated, but creates very little garbage. In fact the StringBuilder can be recycled. ;)

Note: this doesn't create a String or a String[]

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Peter, thanks for the code. This is great but I ended up going with zapl's solution because it was a more direct replacement for my current code base. Although, I may revisit this if/when my memory profiling is done. – Justin Dec 06 '13 at 14:02
2

Basically, if String[0].equals("true") then enable debugging.

You can get rid of the creation of the array and substrings by comparing directly to the input string. That's not avoiding to create the input string like Peter Lawrey's solution does but could be less work to change (although I doubt that).

public static boolean fieldMatches(String line, int fieldIndex, String other) {
    int currentField = 0;
    int startIndex = 0;
    while (currentField < fieldIndex) {
        startIndex = line.indexOf('|', startIndex);

        // not enough fields
        if (startIndex < 0)
            return false;

        startIndex++;
        currentField++;
    }

    int start = startIndex;
    int end = line.indexOf('|', startIndex);
    if (end == -1) {
        end = line.length();
    }
    int fieldLength = end - start;

    // make sure both strings have the same length
    if (fieldLength != other.length())
        return false;

    // regionMatches does not allocate objects
    return line.regionMatches(start, other, 0, fieldLength);
}

public static void main(String[] args) {
    String line = "Config|true"; // from BufferedReader
    System.out.println(fieldMatches(line, 0, "Config"));
    System.out.println(fieldMatches(line, 1, "true"));
    System.out.println(fieldMatches(line, 1, "foobar"));
    System.out.println(fieldMatches(line, 2, "thereisnofield"));
}

Output

true
true
false
false
zapl
  • 63,179
  • 10
  • 123
  • 154
  • I ended up using this technique because it was less work as a direct replacement for my current code base. – Justin Dec 06 '13 at 14:00
1

Just an idea. Don't split anything. Do the opposite - append them (e.g. in some StringBuilder) and keep them in one big String or actually StringBuilder would be better. Strings in it can be separated by | and (what's currently) String arrays with say #.

Then just return indices from the method splitFields - the start index and the end index (of what's currently your String[]).

Just throwing out some idea here. Not sure of your exact use cases scenario, depends what you do with the return value you get back.

Of course, you will need to manage that big StringBuilder yourself and remove data from it when you don't need that data any more, otherwise it will eventually grow too large.

Even if this idea is not directly applicable, I hope you get my point - I think you need some pool or memory area or something like that which you would manage yourself.

peter.petrov
  • 38,363
  • 16
  • 94
  • 159
1

Since the biggest "offenders" here are the strings created as the result of parsing the input, followed by arrays of strings created once per call, and because it appears from one of your comments that you do not need all substrings at once, you could make an object that feeds you strings one at a time, reusing the same StringBuilder object as it goes.

Here is a skeletal class showing how this can be done:

class Splitter {
    private String s="";
    private int pos = 0;
    public void setString(String newS) {
        s = newS;
        pos = 0;
    }
    boolean tryGetNext(StringBuilder result) {
        result.delete(0, result.length());
        // Check if we have anything to return
        if (pos == s.length()) {
            return false;
        }
        // Go through the string starting at pos, adding characters to result
        // until you hit the pipe '|'
        // At that point stop and return
        while (...) {
            ...
        }
        return true;
    }
}

Now you can use this class as follows:

StringBuilder sb = new StringBuilder(MAX_LENGTH);
Splitter splitter = new Splitter();
for (String s: sourceOfStringsToBeSplit) {
    sb.setString(s);
    while (splitter.tryGetNext(sb)) {
        ... // Use the string from sb
    }
}

If the sizing of the internal buffer of sb is done correctly, this code will create three objects during the entire run - the Splitter, the StringBuilder, and the character array inside the StringBuilder. You need to be careful when you use StringBuilder to not create additional objects, though - specifically, you need to avoid calling toString() on it. For example, instead of comparing the StringBuilder for equality to a fixed string like this

if (sb.toString().equals(targetString)) {
    ...
}

you should write

if (targetString.length() == sb.length() && sb.indexOf(targetString) == 0) {
    ...
}
Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
0

do like this

static String[] splitFields(String s) {    
    List<String> list = new ArrayList<String>();
    StringBuilder sb  = new StringBuilder();
    for (int i = 0; i < s.length(); i++) {
         char charAt = s.charAt(i);
         if (charAt == '|'){
            list.add(sb.toString());
            sb = new StringBuilder();
        }else{
           sb.append(charAt);
        }
    }
    list.add(sb.toString());//last chunk
    return list.toArray(new String[list.size()]);; 
}
Prabhakaran Ramaswamy
  • 25,706
  • 10
  • 57
  • 64
  • @HotLicks any suggestion please? – Prabhakaran Ramaswamy Dec 02 '13 at 20:07
  • You create multiple StringBuilders you don't need, an ArrayList when the OP's scheme gets away with a much cheaper string array, and the purpose of the code is not obvious. One can glance at the OP's code and get the gist of what it's trying to do, whereas yours is significantly less obvious. – Hot Licks Dec 02 '13 at 20:17
  • Ultimately you'll just need to run experiments and I see this as a valid experiment. I think in this case you should try resetting the stringbuilder with sb.setLength(0) instead of creating a new one, as it retains the backing array and would avoid the GC. – eSniff Dec 02 '13 at 20:22
  • @eSniff this http://stackoverflow.com/questions/5192512/how-to-clear-empty-java-stringbuilder reference says sb.setLength(0) is more costly process compare to creating new one. – Prabhakaran Ramaswamy Dec 02 '13 at 20:26
  • Only if you know the size you're storing in the SB, as is mentioned in the answer with the most votes. And the comment was regarding cpu performance, not heap overhead. But, you should experiment not assume SF is always right. – eSniff Dec 02 '13 at 20:36
0

Why reinvent the wheel? This is far simpler and doesn't require you count the substrings first.

http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split%28java.lang.String%29

Dodd10x
  • 3,344
  • 1
  • 18
  • 27
0

Why not something like: http://javacsv.sourceforge.net/com/csvreader/CsvReader.html

Is this data purely pipe delimited text or is it something more specific like HL7 or ER7 which warrants using a specific parser?

Freiheit
  • 8,408
  • 6
  • 59
  • 101
0

The answers above seem to not answer your question.

They're all suggesting you use String.split() - and I'll admit, that's what I recommend too.

But that's not what you're wondering. You want something with less memory churn. And the answer to that is no, there is not. In fact, what you're doing is more efficient than split. (It special cases single-character regexes, but uses a List and then exports the list as a String[].) - If your goal is indeed to reduce memory footprint, you do not want to use split.

However, what I don't see is if you've identified GC concerns as actually being a problem. Modern JVMs are very good at cleaning up short-lived objects. So if you're trying to plan ahead, don't - just use split. If you've identified the GC as a problem already and are trying to find a solution, this solution is the best you will get (unless you implement your own String object that keeps the same backing character array like Java used to do.)

You could pass in a char[] and return the number of characters actually put into that array for your substrings. This way you only ever have one buffer. This assumes you're processing the tokens you get back from this function one at a time. You also could pass in a MutableString object that you create yourself. Getting rid of the String[] will greatly reduce burden on the GC.

Of course, now you have to think about things like buffer overflow, and other nonsense like that. Things that are taken care of for you by the system.

corsiKa
  • 81,495
  • 25
  • 153
  • 204