4

I am parsing a rather large (200 MB) XML file that results in a tree of objects each defining a bunch of parameters (key=value). This data structure is running in a Tomcat webapp and used to lookup those parameters.

Months ago we discovered a heap memory issue on this server. We could solve it by interning the parameter keys and values (most of them being very redundant) which reduced the memory footprint from over 150 MB to as little as 20 MB.

Today I am revisiting the server because people are complaining about startup times. I am profiling into the server and seeing that parsing the XML with XPP3 takes 40 seconds, where String.intern() takes more than 30 seconds.

I know this is a tradeoff. And I know I could do the interning myself. As parsing the XML is single-threaded as simple HashMap might do the job as well. But you know, this feels kind of odd.

Did anybody crunch the numbers to see if it's worth dropping String.intern in favor of a different solution?

So the question is? How can I get contention as low as possible for such problems?

Thanks, Stefan

Stefan Schubert-Peters
  • 5,419
  • 2
  • 20
  • 21
  • Which version of Java are you using? The latest versions support compressed strings where a byte[] is used instead of a char[]. – Peter Lawrey Jul 28 '11 at 11:58
  • Currently we are using almost the latest version of Java 6. But I just noticed this morning that not interning those strings would cost more 300 MBs. Over time this data structure has grown a lot... – Stefan Schubert-Peters Jul 29 '11 at 14:27
  • When your String gets copied from the eden space it is turned into a byte[] if possible. This can half the size of large strings. I assume you are using the 32-bit JVM to minimise memory usage. – Peter Lawrey Jul 29 '11 at 14:30
  • No we don't. Our production Tomcats currently need 4 gigs because of caching etc. But this doesn't count for development environments. On local work stations it could be a problem if the Tomcat used 900 instead of 700 MBs... – Stefan Schubert-Peters Jul 29 '11 at 14:58
  • I have updated my answer with what is likely to be the fastest solution. I recently bought a PC for home use with 24 GB (the memory cost £153 incl tax) ;) – Peter Lawrey Jul 29 '11 at 15:06
  • I was pleased yesterday to find I had run out of memory for the first time (4 GB swapped to SSD) and had to change my program. ;) – Peter Lawrey Jul 29 '11 at 15:33

4 Answers4

3

Add an extra indirection step: Have a second HashMap that keeps the keys, and look up the keys there first before inserting them in the in-memory structures. This will give you much more flexibility than String#intern().

However, if you need to parse that 200MB XML file on every tomcat startup, and the extra 10 seconds make people grumble (are they restarting tomcat every so often?) - that makes flags pop up (have you considered using a database, even Apache Derby, to keep the parsed data?).

Tassos Bassoukos
  • 16,017
  • 2
  • 36
  • 40
  • +1, the xml file's parsed data definitly needs to be cached, for instance along with the file's last modification date – SirDarius Jul 28 '11 at 10:11
  • I was just wondering how efficient an embedded Derby DB could be. Querying is a matter of parsing some sql then instead of just looking into some in-memory data structure. Anyway, you are right: this is something we are thinking about for some time as well, but certainly more effort. – Stefan Schubert-Peters Jul 29 '11 at 14:31
  • I am just going for the map. Only trying to figure out if a FastHashMap ist still faster these days but it looks like maps need almost no time for this million interns (less than a second). Thanks! But still I am curious why a String.intern() would be so inefficient. Maybe it just was not built for such high throughput? – Stefan Schubert-Peters Jul 29 '11 at 14:33
  • I sort of lean with the "who cares". 200MB of XML is a large data set and the I/O time it takes to read it is probably big enough as it is. – Archimedes Trajano Nov 02 '11 at 05:31
  • String.intern is much slower than the costs of reading those 200MBs. With modern HDDs (at least the stuff I am used to work with) reading 200MBs should not need 30s :-) – Stefan Schubert-Peters Jan 04 '13 at 12:10
1

It appears that String.intern() doesn't scale very well as you add more an more Strings. It appears to O(n) with the number of Strings in the pool.

Random rand = new Random();
for(int i=0;i<100;i++) {
    long start = System.nanoTime();
    for(int j=0;j<100000;j++)
        Long.toString(rand.nextLong()).toString().intern();
    long time = System.nanoTime() - start;
    System.out.printf("Took %,d ns on average to intern() a random string%n", time/100000);
}

prints

Took 1,586 ns on average to intern() a random string
Took 3,843 ns on average to intern() a random string
Took 7,551 ns on average to intern() a random string
Took 13,436 ns on average to intern() a random string
Took 20,226 ns on average to intern() a random string
Took 27,609 ns on average to intern() a random string
Took 35,098 ns on average to intern() a random string
Took 42,439 ns on average to intern() a random string
Took 50,801 ns on average to intern() a random string
Took 20,975 ns on average to intern() a random string
Took 4,634 ns on average to intern() a random string
Took 10,512 ns on average to intern() a random string
Took 16,914 ns on average to intern() a random string
Took 23,601 ns on average to intern() a random string
Took 30,230 ns on average to intern() a random string
Took 36,184 ns on average to intern() a random string
Took 43,266 ns on average to intern() a random string

Instead I use an array as a string pool.

private static void testHashArray(String[] strings2, int size) {
    String[] pool = new String[size];
    int hit=0, miss=0;
    long start2 = System.nanoTime();
    for (String s : strings2) {
        int hash = (s.hashCode() & 0x7fffffff) % pool.length;
        String s2 = pool[hash];
        if (s.equals(s2)) {
            hit++;
        } else {
            miss++;
        }
        if (s2 != s)
            pool[hash] = s;
    }
    long time2 = System.nanoTime() - start2;
    System.out.printf("Hash size: %,d took %.3f second. Hit/miss %,d/%,d %n", size, time2 / 1e9, hit, miss);
}

public static void main(String... args) {
    Random rand = new Random();

    // a million unique strings.
    String[] strings = new String[1000 * 1000];
    for (int i = 0; i < strings.length; i++)
        strings[i] = String.valueOf(rand.nextLong());
    // random selection of Strings
    String[] strings2 = new String[10 * 1000 * 1000];
    int totalSize = 0;
    for (int i = 0; i < strings2.length; i++) {
        int idx = (int) Math.pow(strings.length, rand.nextFloat());
        String s = strings[idx];
        strings2[i] = s;
        totalSize += s.length() + 16; // with overhead
    }
    System.out.printf("Original size %,d%n", totalSize);

    Set<String> uniqueStrings = Collections.newSetFromMap(new IdentityHashMap<String, Boolean>());
    uniqueStrings.addAll(Arrays.asList(strings2));
    System.out.printf("Unique strings %,d%n", uniqueStrings.size());

    long start = System.nanoTime();
    HashMap<String,String> map = new HashMap();
    for(String s: strings2)
        map.put(s,s);
    long time = System.nanoTime() - start;
    System.out.printf("Took %.3f second to map strings%n", time/1e9);

    testHashArray(strings2, 10192);
    testHashArray(strings2, 101929);
    testHashArray(strings2, 1019291);
}

prints

Original size 353,293,201
Unique strings 766,222
Took 0.979 second to map strings
Hash size: 10,192 took 0.357 second. Hit/miss 5,213,210/4,786,790 
Hash size: 101,929 took 0.309 second. Hit/miss 7,202,094/2,797,906 
Hash size: 1,019,291 took 0.254 second. Hit/miss 8,789,382/1,210,618 

If doing the intern is slow, how about performing it after the load in a background thread. After the server is loaded, you can intern() the strings when a duplicate is found.

Do you really need to save 130 MB? I know it sounds great but would the memory be used for something else anyway?

For you want a faster form on intern() you can use a fixed size array.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • The internal data structure is some sort of a sitemap figuring out from wich systems which snippet is pulled. So it would be nice if it was there at startup. Some months ago we thought about doing it earlier but in parallel, because I/O is not the bottleneck. So thanks for the suggestion! – Stefan Schubert-Peters Jul 29 '11 at 14:36
  • Thanks for the analysis (you seemed to add later?) - anyway a very sad performance for such a useful feature. I'm going with a map now. The good part is that I can throw the map away afterwards. It saves the time and the memory footprint is good as well. – Stefan Schubert-Peters Jul 29 '11 at 15:02
  • On my machine, using an array was 3-4x faster than a map. ;) – Peter Lawrey Jul 29 '11 at 15:10
  • Certainly cool and I appreciate that. But now I/O is the bottleneck so this optimization would add no value – Stefan Schubert-Peters Jul 29 '11 at 15:20
0

We had a problem with a String being parsed into a verified 'Name' object. This were done all over the place in the application and needed to be optimized in both memory and speed.

After a few test runs we eventually ended up with a solution processing char arrays, both during parsing and in the implementation of Name.

String.toCharArray() to retrieve the array of the string, or one can use String.charAt(pos). For quick copying between arrays we used System.arrayCopy.

The parsing were actually quicker than using a cache for lookup.

Tomas F
  • 7,226
  • 6
  • 27
  • 36
  • 1
    Note that `String.toCharArray()` does **not** give access to the internal array, it creates a new copy each time it is called. The internal array is not directly accessible and can be shared between multiple `String` instances. – Joachim Sauer Jul 28 '11 at 10:14
  • Ok this is really a huge optimization for such a rather 'simple' problem. I think I would try the Derby DB proposal first before I tried to increase complexity so much. It already is too complex... – Stefan Schubert-Peters Jul 29 '11 at 14:39
0

Here's another thought, though it may sound a bit on the cooky side. Have you thought of just writing a code generator that just parses your XML file and spits out Java code which populates a map using actual strings (those get interned at compile time)

Something like this

public final class ConfigurationData {
  public static String get(String key) {
    return map.get(key);
  }
  private static final Map<String,String> MAP;
  static {
    MAP = new HashMap<String,String>([[[ number of records to load up ]]]);
    MAP.put([[[key 1]]], [[[ value 1 ]]]);
    MAP.put([[[key 2]]], [[[ value 2 ]]]);
    ...
  }
}

This follows the same concept as precompiled JSPs to save on the first user penalty, but it adds another build step and becomes a deployment if there is a configuration file change (which should be controlled anyway).

Archimedes Trajano
  • 35,625
  • 19
  • 175
  • 265
  • While this sounds like a nice thing to experiment with I think the most pragmatic approach is the one I picked. What I did not add as a constraint, though: The XML file is parsed many times as long as the application runs and its content changes. So your proposal would impose extra effort with on-the-fly class loading in a running Tomcat webapp :-) – Stefan Schubert-Peters Jan 27 '12 at 09:57