0

Situation:

I'm writing a Kafka producer that fetches Json data (in large multi megabytes chuncks) from a web request.

I need to check this data for a datefield, and grab the largest one.

Next I chop up the Json array object into smaller Json objects ("rows of data") and serialize them as avro (generic) records.

While my application works, it's using quite a lot of memory for something rather lightweight. I'm suspecting the JSON parsing is the culprit.

Or rather, I'm the one that's not writing proper code.

Question:

How can I lower my memory footprint (can spike over 1GB until GC comes and saves the day) I was thinking of "finding" every json object and doing an operation per json object, instead of reading in the whole thing. However I'm not inclined to write a whole codebase for this, as this would just need to be a JSON object. This has to work generically. Having my own custom code, just to find JSON objects would just be too error prone whenever edge-cases arise.

code

def get(url: String, headers: List[String]): String = {
 val httpEntity = try {
   getRequest(url, headers)
 } catch {

  ....
 }

if (httpEntity == null) return "" 

val inputStream = httpEntity.getContent
    var content = ""
    try {
      content = scala.io.Source.fromInputStream(inputStream,Codec.UTF8.name).getLines.mkString
    } catch {
      case e: Exception => logger.error("can't fetch/parse data from http stream.")
        inputStream.close()
        throw e
    }
    inputStream.close()
    if (content == null) {
      throw new RuntimeException("...")
    }
    //logger.debug(content)
    content
}

This is called here:

val stringData= someclass.get(url, headers)
if (!stringData.trim.equals("[]")) parseJson(stringData, "some key", "date found in records","some yyyy/dd stuff here"))

The parsing code:

private def parseJson(string: String, keyName: String, dateField: String, format: SimpleDateFormat): (Date, Array[(String, String)]) = {
    val arr = new JSONArray(string)
    val kvList = new ArrayBuffer[(String, String)]
    logger.debug(s"${arr.length} records found, will loop over json objects")
    if (arr.length() > 0) {
      logger.info(s"parsing ${arr.length} records")
      for (i <- 0 until arr.length ) {
        kvList.append((arr.getJSONObject(i).getString(keyName), arr.getJSONObject(i).toString))
      }
       //this is where I go and get the datefield I wanted

      (extractJsonDate.getMaxDate(arr: JSONArray, dateField: String, format: SimpleDateFormat), kvList.toArray)
    } else {
      logger.info("didn't parse JSON, empty collection received in parser.")
      (null, kvList.toArray)
    }
  }

... next I loop over every object, parse as avro & send it on in Kafka but that's besides the point here.

Havnar
  • 2,558
  • 7
  • 33
  • 62

1 Answers1

-2

There are a few things that I think can help you here.

  1. Turn on String Deduplication in your JVM Garbage collector

    -Xmx20M -XX:+UseG1GC -XX:+UseStringDeduplication

  2. Find a lightweight JSONParser that might be more suited to your needs. A bit of Googling will help you find exactly what you need.

  3. When you're downloading megabyte chunks, instead of storing them all in memory, consider inserting them into a database table. You'll incur some slowdown to speed but you're not putting as much stress on memory.