0

I have a simple single key-valued Map(K,V) myDictionary that is populated by my program and at the end I want to write it as JSON format string in a text file - as I would need parse them later.

I was using this code earlier,

Some(new PrintWriter(outputDir+"/myDictionary.json")).foreach{p => p.write(compact(render(decompose(myDictionary)))); p.close}

I found it to be slower as the input size increased. Later, I used this var out = new

var out = new PrintWriter(outputDir+"/myDictionary.json");
out.println(scala.util.parsing.json.JSONObject(myDictionary.toMap).toString())

This is proving to be bit faster.

I have run this for sample input and found that this is faster than my earlier approach. I assuming my input map size would reach at least a million values( >1GB text file) (K,V) hence I want to make sure that I follow the faster and memory efficient approach for Map serialization process.What are other approaches that you would recommend,that I can look into to optimize this.

Learner
  • 1,685
  • 6
  • 30
  • 42

1 Answers1

4

The JSON support in the standard Scala library is probably not the best choice. Unfortunately the situation with JSON libraries for Scala is a bit confusing, there are many alternatives (Lift JSON, Play JSON, Spray JSON, Twitter JSON, Argonaut, ...), basically one library for each day of the week... I suggest you have a look at these at least to see if any of them is easier to use and more performative.


Here is an example using Play JSON which I have chosen for particular reasons (being able to generate formats with macros):

object JsonTest extends App {
  import play.api.libs.json._

  type MyDict = Map[String, Int]

  implicit object MyDictFormat extends Format[MyDict] {
    def reads(json: JsValue): JsResult[MyDict] = json match {
      case JsObject(fields) =>
        val b = Map.newBuilder[String, Int]
        fields.foreach {
          case (k, JsNumber(v)) => b += k -> v.toInt
          case other => return JsError(s"Not a (string, number) pair: $other")
        }
        JsSuccess(b.result())

      case _ => JsError(s"Not an object: $json")
    }

    def writes(m: MyDict): JsValue = {
      val fields: Seq[(String, JsValue)] = m.map {
        case (k, v) => k -> JsNumber(v)
      } (collection.breakOut)

      JsObject(fields)
    }
  }

  val m       = Map("hallo" -> 12, "gallo" -> 34)
  val serial  = Json.toJson(m)
  val text    = Json.stringify(serial)
  println(text)
  val back    = Json.fromJson[MyDict](serial)
  assert(back == JsSuccess(m), s"Failed: $back")
}

While you can construct and deconstruct JsValues directly, the main idea is to use a Format[A] where A is the type of your data structure. This puts more emphasis on type safety than the standard Scala-Library JSON. It looks more verbose, but in end I think it's the better approach.

There are utility methods Json.toJson and Json.fromJson which look for an implicit format of the type you want.

On the other hand, it does construct everything in-memory and it does duplicate your data structure (because for each entry in your map you will have another tuple (String, JsValue)), so this isn't necessarily the most memory efficient solution, given that you are operating in the GB magnitude...


Jerkson is a Scala wrapper for the Java JSON library Jackson. The latter apparently has the feature to stream data. I found this project which says it adds streaming support. Play JSON in turn is based on Jerkson, so perhaps you can even figure out how to stream your object with that. See also this question.

Community
  • 1
  • 1
0__
  • 66,707
  • 21
  • 171
  • 266
  • Exactly my problem. Too many choices and I couldn't find a proper article comparing them except this post - http://stackoverflow.com/questions/8054018/json-library-for-scala . Googling didn't help me either. – Learner Oct 18 '13 at 09:34
  • I have added an example for Play JSON. I vaguely remember there is also a possibility with one of these libraries to use streaming I/O, which would probably avoid having to use twice as much memory as your input data structure... – 0__ Oct 18 '13 at 09:43
  • If memory is an issue, perhaps you could just split your big map into several chucks which are encoded as individual JSON objects... – 0__ Oct 18 '13 at 09:55
  • Thanks for the detailed updates. Memory is not an issue. Just that, while writing it as file and reading back takes a long time. Is there any way to reduce the overhead ? – Learner Oct 18 '13 at 09:57
  • No, performance will at best be linear with the size of your map. If it blocks your application, do I/O in a dedicated thread or future... Otherwise you could use a different mechanism, like a Key-Value store (I have good experience with BerkeleyDB JE). There is also [this project](https://github.com/emchristiansen/PersistentMap), not sure how mature it is. – 0__ Oct 18 '13 at 10:00