I have a application which takes some really big delimited files (~10 to 15 M records) and ingest it into Kafka after doing some preprocessing. As a part of this preprocessing we convert the delimited records into json and add metadata to that json message (FileName, Row number). We are doing it using the Json4s Native serializer like below:
import org.json4s.native.Serialization._
//some more code and below is the final output.
write(Map(
"schema" -> schemaName,
"data" -> List(resultMap),
"flag" -> "I")
)
Once the message is converted to Json we add message metadata like:
def addMetadata(msg: String, metadata: MessageMetadata): String = {
val meta = write(asJsonObject(metadata))
val strippedMeta = meta.substring(1, meta.length -1)
val strippedMessage = msg.substring(1, msg.lastIndexOf("}"))
"{" + strippedMessage + "," + strippedMeta + "}"
msg
}
The final message looks like this at the end:
{"schema":"SchemaName"
"data": [
],
"flag": "I",
"metadata":{"srcType":"file","fileName":"file","line":1021}}
Now both of this methods are leaking some memory and throwing below error. The application have capacity of processing 300k messages per minute but after around 4-5 mins its slowing down and eventually dies. I know string concatenation generates lots of garbage objects and want to know what is the best way of doing it?
java.lang.OutOfMemoryError: GC overhead limit exceeded