Java : linear algorithm but non-linear performance drop, where does it come from?

Question

I am currently having heavy performance issues with an application I'm developping in natural language processing. Basically, given texts, it gathers various data and does a bit of number crunching.

And for every sentence, it does EXACTLY the same. The algorithms applied to gather the statistics do not evolve with previously read data and therefore stay the same.

The issue is that the processing time does not evolve linearly at all: 1 min for 10k sentences, 1 hour for 100k and days for 1M...

I tried everything I could, from re-implementing basic data structures to object pooling to recycles instances. The behavior doesn't change. I get non-linear increase in time that seem impossible to justify by a little more hashmap collisions, nor by IO waiting, nor by anything! Java starts to be sluggish when data increases and I feel totally helpless.

If you want an example, just try the following: count the number of occurences of each word in a big file. Some code is shown below. By doing this, it takes me 3 seconds over 100k sentences and 326 seconds over 1.6M ...so a multiplicator of 110 times instead of 16 times. As data grows more, it just get worse...

Here is a code sample: Note that I compare strings by reference (for efficiency reasons), this can be done thanks to the 'String.intern()' method which returns a unique reference per string. And the map is never re-hashed during the whole process for the numbers given above.

public class DataGathering
{
 SimpleRefCounter<String> counts = new SimpleRefCounter<String>(1000000);

 private void makeCounts(String path) throws IOException
 {

  BufferedReader file_src = new BufferedReader(new FileReader(path));

  String line_src;

  int n = 0;
  while (file_src.ready())
  {
   n++;

   if (n % 10000 == 0)
    System.out.print(".");

   if (n % 100000 == 0)
    System.out.println("");

   line_src = file_src.readLine();

   String[] src_tokens = line_src.split("[ ,.;:?!'\"]");

   for (int i = 0; i < src_tokens.length; i++)
   {
    String src = src_tokens[i].intern();
    counts.bump(src);
   }

  }
  file_src.close();
 }

 public static void main(String[] args) throws IOException
 {
  String path = "some_big_file.txt";

  long timestamp = System.currentTimeMillis();

  DataGathering dg = new DataGathering();
  dg.makeCounts(path);


  long time = (System.currentTimeMillis() - timestamp) / 1000;
  System.out.println("\nElapsed time: " + time + "s.");
 }
}

public class SimpleRefCounter<K>
{
 static final double GROW_FACTOR = 2;
 static final double LOAD_FACTOR = 0.5;

 private int capacity;

 private Object[] keys;
 private int[] counts;


 public SimpleRefCounter()
 {
  this(1000);
 }

 public SimpleRefCounter(int capacity)
 { 
  this.capacity = capacity;
  keys = new Object[capacity];
  counts = new int[capacity];
 }



 public synchronized int increase(K key, int n)
 {
  int id = System.identityHashCode(key) % capacity;

  while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
   id = (id + 1) % capacity;


  if (keys[id] == null)
  {
   key_count++;
   keys[id] = key;

   if (key_count > LOAD_FACTOR * capacity)
   {
    resize((int) (GROW_FACTOR * capacity));
   }
  }


  counts[id] += n;

  total += n;

  return counts[id];
 }



 public synchronized void resize(int capacity)
 {
  System.out.println("Resizing counters: " + this);

  this.capacity = capacity;

  Object[] new_keys = new Object[capacity];
  int[] new_counts = new int[capacity];

  for (int i = 0; i < keys.length; i++)
  {
   Object key = keys[i];
   int count = counts[i];

   int id = System.identityHashCode(key) % capacity;

   while (new_keys[id] != null && new_keys[id] != key) // if it's occupied, let's move to the next one!
    id = (id + 1) % capacity;

   new_keys[id] = key;
   new_counts[id] = count;
  }

  this.keys = new_keys;
  this.counts = new_counts;
 }


 public int bump(K key)
 {
  return increase(key, 1);
 }

 public int get(K key)
 {
  int id = System.identityHashCode(key) % capacity;

  while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
   id = (id + 1) % capacity;


  if (keys[id] == null)
   return 0;
  else
   return counts[id];
 }
    }

Any explanations? Ideas? Suggestions?

...and, as said in the beginning, it is not for this toy example in particular but for the more general case. This same exploding behavior occurs for no reason in the more complex and larger program.

How many tokens do you have per line ? If you have a lot of them, you can approach the O(n^2) ... it could explain a lot of things. — Kaltezar, Feb 16 '10 at 19:30
Whithout analyzing in depth all what you did, what causes you to think that comparing String by reference is faster? See http://www.codeinstructions.com/2009/01/busting-javalangstringintern-myths.html and http://kohlerm.blogspot.com/2009/01/is-javalangstringintern-really-evil.html — flybywire, Feb 16 '10 at 19:30
the number of tokens varies, around 10 to 40 on average. but we don't care how the time varies per tokens in a line, whatever it is, two packets of thousands lines (of same length distribution) should be treated in approximately the same time. Which is not the case. — Arnaud, Feb 16 '10 at 19:44
as for the intern(), it gave me a tremendous gains both in memory usage and execution speed, around one order of magnitude. (Without it, I ran out of memory with 64 Gb ram ;) ) — Arnaud, Feb 16 '10 at 19:46
Tom Hawtin gave some great advice about commenting-out the call to `counts.bump()`. I second that advice; you might learn something about hashtable implementation if you do so. — kdgregory, Feb 16 '10 at 21:20
The more I work on it, the less conclusive it gets... But I had not much time today ...I'll give it one more shot tomorrow. Thanks for all hints. — Arnaud, Feb 17 '10 at 16:40

score 4 · Answer 1 · answered Feb 16 '10 at 20:09

4

Rather than feeling helpless use a profiler! That would tell you where exactly in your code all this time is spent.

answered Feb 16 '10 at 20:09

meriton

68,356
14
108
175

Well ...I thought of the same ...But somehow high quality open source java profilers for linux seem not very easy to come by. ...I guess I'll have nothing else to do than trying each out one after another, not being sure at all to get the magic answer from it. – Arnaud Feb 16 '10 at 20:27
1

Just commenting out the `counts.bump(src);` line would give a clue. – Tom Hawtin - tackline Feb 16 '10 at 21:01
Eclipse has profiling tools. Netbeans has profiling tools. VisualVM sips with JDK1.6.0_07+ (jvisualvm) – basszero Feb 17 '10 at 02:16

Tom Hawtin - tackline · Answer 2 · 2010-02-16T21:00:22.890

2

Bursting the processor cache and thrashing the Translation Lookaside Buffer (TLB) may be the problem.

For String.intern you might want to do your own single-threaded implementation.

However, I'm placing my bets on the relatively bad hash values from System.identityHashCode. It clearly isn't using the top bit, as you don't appear to get ArrayIndexOutOfBoundsExceptions. I suggest replacing that with String.hashCode.

edited Feb 16 '10 at 21:00

answered Feb 16 '10 at 20:04

Tom Hawtin - tackline

145,806
30
211
305

...eh ...sorry for my ignorance, but what's TLA? As for the other, I checked how the collision rate evolves in the hash table, everything seems fine there. – Arnaud Feb 16 '10 at 20:21
Translation Lookaside Buffer (should be TLB, but it is a TLA) - page map of logical addresses to physical. – Tom Hawtin - tackline Feb 16 '10 at 21:00
I'm putting my bet on the linear fallback hashtable (versus chained table). – kdgregory Feb 16 '10 at 21:20
@kdgregory It should certainly should be easy to, say, replace the probing map with `java.util.HashSet`. – Tom Hawtin - tackline Feb 16 '10 at 21:28

score 2 · Answer 3 · answered Feb 17 '10 at 01:56

String[] src_tokens = line_src.split("[ ,.;:?!'\"]");

Just an idea -- you are creating a new Pattern object for every line here (look at the String.split() implementation). I wonder if this is also contributing to a ton of objects that need to be garbage collected?

I would create the Pattern once, probably as a static field:

final private static Pattern TOKEN_PATTERN = Pattern.compile("[ ,.;:?!'\"]");

And then change the split line do this:

String[] src_tokens = TOKEN_PATTERN.split(line_src);

Or if you don't want to create it as a static field, as least only create it once as a local variable at the beginning of the method, before the while.

score 1 · Answer 4 · answered Feb 16 '10 at 19:36

1

In get, when you search for a nonexistent key, search time is proportional to the size of the set of keys.

My advice: if you want a HashMap, just use a HashMap. They got it right for you.

answered Feb 16 '10 at 19:36

novalis

1,112
6
12

Luckily, I never search for a non-existant key. But even in that case, the average time taken is constant. (Just a few collisions at worse). The classic HashMap was too slow and memory consuming for my purposes. IdentityHashMap did the job but I made this custom implementation to be sure the performance drop did not come from it. – Arnaud Feb 16 '10 at 20:07
1

So I actually hadn't tested the code before commenting. Now I have tested it, and it seems to only increase linearly in size with size of input text. Even when I tried doubling the size of the vocabulary (I ran a copy of the text through rot13 and concatenated it to the original), the increase was still linear in number of words. Do you have a very large vocabulary? I was just using a (multiple-times-repeated) copy of the Scarlet Letter to test with. – novalis Feb 16 '10 at 22:23

score 1 · Answer 5 · answered Feb 16 '10 at 19:40

1

You are filling up the Perm Gen with the string intern. Have you tried viewing the -Xloggc output?

answered Feb 16 '10 at 19:40

Amir Afghani

37,814
16
84
124

..eeeh ...dunno ...I just looked at how memory usage evolves with time. ...I'll try the -Xloggc right now – Arnaud Feb 16 '10 at 20:03
....hmmm Unrecognized option: -Xloggc Could not create the Java virtual machine. probably not an adequate JVM :( – Arnaud Feb 16 '10 at 20:11
what jdk are you using? loggc was around at least as far back as sun'd jdk 1.4 – Peter Recore Feb 16 '10 at 20:29
sun's 1.6.0.7 for linux 64 bit. Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) – Arnaud Feb 16 '10 at 20:38
1

For -Xloggc you have to specify an output file so: java -Xloggc:foo.out Foo.jar – Amir Afghani Feb 16 '10 at 21:33

score 0 · Answer 6 · answered Feb 16 '10 at 19:25

0

I would guess it's just memory filling up, growing outside the processor cache, memory fragmentation and the garbage collection pauses kicking in. Have you checked memory use at all? Tried to change the heap size the JVM uses?

answered Feb 16 '10 at 19:25

djc

11,603
5
41
54

yeah, I gave it many GB as extra heap ...the heap doesn't seem to grow much during execution. – Arnaud Feb 16 '10 at 19:27
...but I have the same suspicions ...the GC seems like the easy suspect, as usual ;) ...however, making the change between recycling millions of objects and none did not make a big impact ...leading me to wonder if the GC is really the reason. – Arnaud Feb 16 '10 at 19:32

score 0 · Answer 7 · answered Feb 16 '10 at 19:50

0

Try to do it in python, and run the python module from Java.
Enter all the keys in the database, and then execute the following query:

select key, count(*) from keys group by key
Have you tried to only iterate through the keys without doing any calculations? is it faster? if yes then go with option (2).

answered Feb 16 '10 at 19:50

Omar Al Kababji

1,768
4
16
31

2

it's not a DB issue and python would not help to know why java gets slower :) – Arnaud Feb 16 '10 at 20:08

score 0 · Answer 8 · edited May 23 '17 at 12:11

0

Can't you do this? You can get your answer in no time.

edited May 23 '17 at 12:11

Community

1
1

answered Feb 16 '10 at 21:15

Mike Dunlavey

40,059
14
91
135

score 0 · Answer 9 · answered Feb 16 '10 at 21:40

It's me, the original poster, something went wrong during registration, so I post separately. I'll try the various suggestions given. PS for Tom Hawtin: thanks for the hints, perhaps the 'String.intern()' takes more and more time as vocabulary grows, i'll check that tomorrow, as everything else.

Java : linear algorithm but non-linear performance drop, where does it come from?

9 Answers9