0

i have a small ETL-Like Application which fetches some data from several sources, combines them by some API Rules and finally spitts them out into a destination file.

These items have an calculated unique ID-String, and due to the mixup of several sources and API-Rules, it might happen to genererate the same destination object twice or more often. sounds weird, but makes sense in detail. Unfortunately i cannot detect that before exporting.

To have each unique-ID object only exported once, i thought that i could just store their ids and compare by:

private val Ids = new mutable.HashSet[String]
def write(entity:Entity) {
  val eID = entity.id.intern
  Ids.synchronized { // i somethimes use .par.map and call write()
    if(Ids.contains(eID)) {
      return
    }

    Ids += eID
  }
  .. process

now that works fine for a while, but having ~ 50.000.000 elements in that hashset, it dramatically slows down the whole process.

i start the app with string-deduplication, xmx/xms at 32Gig. its only using about 9 Gig max, so so dont know what causes the slowdown.. i set the StringTableSize to astronomical sized as well as sensefull high ones, without noticeable changes.

is my idea of comparing bad in gernal? or the choice of hashset? Any recommendations what to debug? banana?

If i comment out the Ids.contains and += line, my app takes about 17 minutes. with id comparsion enabled its several hours.

Any idea/clue/advice?

is my idea of comparing bad in general? or the choice of hashset? any recommendations what/how to debug? Using VisualVM i just sag about 60% time spent on that contains-method.

might it just be okay to take this long time, because for each nth element, i have n-1 comparsions..?

Thanks in advance.

Scala 2.11.5

MomStopFlashing
  • 255
  • 1
  • 2
  • 7

1 Answers1

0

I am not sure if this has anything to do with the scala HashSet. The culprit is the String#intern method, which has some serious performance issues. See Performance penalty of String.intern()

If you want interning to make sure that you can do your computation in your working set, you will have to write your own intern table. Writing a good high-performance intern table can be quite tricky. So if you can afford a dependency on guava, use Guava Cache.

Have you tried just dropping the interning completely?

Community
  • 1
  • 1
Rüdiger Klaehn
  • 12,445
  • 3
  • 41
  • 57
  • Hey. yes, intern was an expensive try to speed it up. actually i could not even notice a loss or gain using .intern here. to be honest, it was quite a noobish act of desperation ;) – MomStopFlashing Feb 26 '15 at 17:36