0

I have a application which takes some really big delimited files (~10 to 15 M records) and ingest it into Kafka after doing some preprocessing. As a part of this preprocessing we convert the delimited records into json and add metadata to that json message (FileName, Row number). We are doing it using the Json4s Native serializer like below:

import org.json4s.native.Serialization._ 
//some more code and below is the final output.
write(Map(
      "schema" -> schemaName,
      "data" -> List(resultMap),
      "flag" -> "I")
    )

Once the message is converted to Json we add message metadata like:

def addMetadata(msg: String, metadata: MessageMetadata): String = {
val meta = write(asJsonObject(metadata))
val strippedMeta = meta.substring(1, meta.length -1)
val strippedMessage = msg.substring(1, msg.lastIndexOf("}"))
"{" + strippedMessage + "," + strippedMeta + "}"
msg
}

The final message looks like this at the end:

{"schema":"SchemaName"
  "data": [
 ],
  "flag": "I",
 "metadata":{"srcType":"file","fileName":"file","line":1021}}

Now both of this methods are leaking some memory and throwing below error. The application have capacity of processing 300k messages per minute but after around 4-5 mins its slowing down and eventually dies. I know string concatenation generates lots of garbage objects and want to know what is the best way of doing it?

java.lang.OutOfMemoryError: GC overhead limit exceeded

Explorer
  • 1,491
  • 4
  • 26
  • 67
  • 1
    Even if string concatenation allocates a few unnecessary character buffers here and there, it shouldn't *leak* anything in the end of the day. If your application bleeds to death, the reason might be elsewhere. You might find [this](https://medium.com/@dkomanov/scala-string-interpolation-performance-21dc85e83afd) at least entertaining... – Andrey Tyukin Mar 21 '18 at 20:44

2 Answers2

3

When producing tons of such short messages, then there'll tons of tiny short-living objects created. Such tiny short-living objects are something the GC can handle very efficiently - it's very improbable that it could cause any serious problems.

The message

java.lang.OutOfMemoryError: GC overhead limit exceeded

means that GC was working very hard without any success. That's not what happens with tiny short-living objects. Most probably, you have a big memory leak which takes away all of your memory after a few minutes. Then the GC has to fail as there's nothing to reclaim.

Don't waste time on optimizing something which may be harmless. Use some tool to find the leak instead.

maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • I found the object that was causing memory leak, I was using TrieMap as a Cache for some data and it is blowing up once the volume gets increased over 3M. – Explorer Mar 22 '18 at 15:01
2

Try to use Stringbuilder, you can avoid creating unnecessary objects.

Is string concatenation in scala as costly as it is in Java?

cvdr
  • 939
  • 1
  • 11
  • 18
  • @Explorer, im not familiar with Scala, but in Java String concatenation cost heavily. Most languages have the same behavior. The problem is that to += append to a string reconstructs a new string, so it costs something linear to the length of your strings (sum of both). https://stackoverflow.com/questions/1532461/stringbuilder-vs-string-concatenation-in-tostring-in-java – cvdr Mar 21 '18 at 21:07