Situation:
I'm writing a Kafka producer that fetches Json data (in large multi megabytes chuncks) from a web request.
I need to check this data for a datefield, and grab the largest one.
Next I chop up the Json array object into smaller Json objects ("rows of data") and serialize them as avro (generic) records.
While my application works, it's using quite a lot of memory for something rather lightweight. I'm suspecting the JSON parsing is the culprit.
Or rather, I'm the one that's not writing proper code.
Question:
How can I lower my memory footprint (can spike over 1GB until GC comes and saves the day) I was thinking of "finding" every json object and doing an operation per json object, instead of reading in the whole thing. However I'm not inclined to write a whole codebase for this, as this would just need to be a JSON object. This has to work generically. Having my own custom code, just to find JSON objects would just be too error prone whenever edge-cases arise.
code
def get(url: String, headers: List[String]): String = {
val httpEntity = try {
getRequest(url, headers)
} catch {
....
}
if (httpEntity == null) return ""
val inputStream = httpEntity.getContent
var content = ""
try {
content = scala.io.Source.fromInputStream(inputStream,Codec.UTF8.name).getLines.mkString
} catch {
case e: Exception => logger.error("can't fetch/parse data from http stream.")
inputStream.close()
throw e
}
inputStream.close()
if (content == null) {
throw new RuntimeException("...")
}
//logger.debug(content)
content
}
This is called here:
val stringData= someclass.get(url, headers)
if (!stringData.trim.equals("[]")) parseJson(stringData, "some key", "date found in records","some yyyy/dd stuff here"))
The parsing code:
private def parseJson(string: String, keyName: String, dateField: String, format: SimpleDateFormat): (Date, Array[(String, String)]) = {
val arr = new JSONArray(string)
val kvList = new ArrayBuffer[(String, String)]
logger.debug(s"${arr.length} records found, will loop over json objects")
if (arr.length() > 0) {
logger.info(s"parsing ${arr.length} records")
for (i <- 0 until arr.length ) {
kvList.append((arr.getJSONObject(i).getString(keyName), arr.getJSONObject(i).toString))
}
//this is where I go and get the datefield I wanted
(extractJsonDate.getMaxDate(arr: JSONArray, dateField: String, format: SimpleDateFormat), kvList.toArray)
} else {
logger.info("didn't parse JSON, empty collection received in parser.")
(null, kvList.toArray)
}
}
... next I loop over every object, parse as avro & send it on in Kafka but that's besides the point here.