i have a small ETL-Like Application which fetches some data from several sources, combines them by some API Rules and finally spitts them out into a destination file.
These items have an calculated unique ID-String, and due to the mixup of several sources and API-Rules, it might happen to genererate the same destination object twice or more often. sounds weird, but makes sense in detail. Unfortunately i cannot detect that before exporting.
To have each unique-ID object only exported once, i thought that i could just store their ids and compare by:
private val Ids = new mutable.HashSet[String]
def write(entity:Entity) {
val eID = entity.id.intern
Ids.synchronized { // i somethimes use .par.map and call write()
if(Ids.contains(eID)) {
return
}
Ids += eID
}
.. process
now that works fine for a while, but having ~ 50.000.000 elements in that hashset, it dramatically slows down the whole process.
i start the app with string-deduplication, xmx/xms at 32Gig. its only using about 9 Gig max, so so dont know what causes the slowdown.. i set the StringTableSize to astronomical sized as well as sensefull high ones, without noticeable changes.
is my idea of comparing bad in gernal? or the choice of hashset? Any recommendations what to debug? banana?
If i comment out the Ids.contains and += line, my app takes about 17 minutes. with id comparsion enabled its several hours.
Any idea/clue/advice?
is my idea of comparing bad in general? or the choice of hashset? any recommendations what/how to debug? Using VisualVM i just sag about 60% time spent on that contains-method.
might it just be okay to take this long time, because for each nth element, i have n-1 comparsions..?
Thanks in advance.
Scala 2.11.5