3

I'd like to compute very large JSON files (about 400 MB each) in Scala.

My use-case is batch-processing. I can receive several very big files (up to 20 GB, then cut to be processed) at the same moment and I really want to process them quickly as a queue (but it's not the subject of this post!). So it's really about distributed architecture and performance issues.

My JSON file format is an array of objects, each JSON object contains at least 20 fields. My flow is composed of two major steps. The first one is the mapping of the JSON object into a Scala object. And the second step is some transformations I'm making on the Scala object data.

To avoid loading all the file in memory, I'd like a parsing library where I can have incremental parsing. There are so many libraries (Play-JSON, Jerkson, Lift-JSON, the built in scala.util.parsing.json.JSON, Gson) and I cannot figure out which one to take, with the requirement to minimize dependencies.

  • Do you have any ideas of a library I can use for high-volume parsing with good performances?

Also, I'm searching a way to process in parallel the mapping of the JSON file and the transformations made on the fields (between several nodes).

  • Do you think I can use Apache Spark to do it? Or are there alternative ways to accelerate/distribute the mapping/transformation?

Thanks for any help.

Best regards, Thomas

Nypias
  • 31
  • 3
  • Perhaps [using tools like Spark is a huge overkill](http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html). – om-nom-nom Oct 13 '13 at 23:11
  • See also http://stackoverflow.com/questions/8898353/parsing-a-large-30mb-json-file-with-net-liftweb-json-or-scala-util-parsing-jso – om-nom-nom Oct 13 '13 at 23:17
  • It's probably worth pointing out that `scala.util.parsing.json.JSON` is being deprecated; I would guess largely because it was originally implemented as a demo for parser combinators. – J Cracknell Oct 14 '13 at 00:07
  • Thanks @om-nom-nom! I added some details about my use-case. I have already read the article "Don't use Hadoop" and maybe Apache Spark is not the solution to my problem. I want something that can evolve if the load increases as well. That's why I thought to Spark. What about Akka with several workers working on each file partition? – Nypias Oct 14 '13 at 00:24

1 Answers1

0

Considering a scenario without Spark, I would advise to stream the json with Jackson Streaming (Java) (see for example there), map each Json object to a Scala case class and send them to an Akka router with several routees that do the transformation part in parallel.

atamborrino
  • 478
  • 4
  • 8