Handling record size more than 3GB in spark

Question

I'm getting below exception when an individual record size is more than 3GB `

java.lang.IllegalArgumentException
App > at java.nio.CharBuffer.allocate(CharBuffer.java:330)
App > at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792)
App > at org.apache.hadoop.io.Text.decode(Text.java:412)
App > at org.apache.hadoop.io.Text.decode(Text.java:389)
App > at org.apache.hadoop.io.Text.toString(Text.java:280)
App > at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$createBaseRdd$1.apply(JsonFileFormat.scala:135)
App > at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$createBaseRdd$1.apply(JsonFileFormat.scala:135)

How can I increase the buffer size for a single record?

Might not be helpful, but worth noting: even if this is possible to do (not sure), it might not be the right approach (will be slow and risky). Can you avoid such a huge record size? How did it come to be? If it's the result of a `RDD.groupByKey`, for example, you'd probably want to replace it with `reduceByKey` or some other aggregation. — Tzach Zohar, Nov 08 '17 at 21:22
It is a json file that has all the records as json array under one key. I'm trying to flatten it. But I'm not able to perform any operation on it. Not even to print the schema of the json array. — DINESHKUMAR MURUGAN, Nov 08 '17 at 21:32
If you can afford to alter the JSON file's structure, can we not split that "single large array" into "an array of arrays" using some utiliy program before you process it further? — Marco99, Nov 09 '17 at 08:42

Oli · Answer 1 · 2017-11-09T06:01:11.267

You probably have one huge line in your file containing the array. Here you get an exception because you are trying to build a CharBuffer that's too big (most likely an integer that became negative after going out of bound). Maximum array/string size in java is 2^31-1 (Integer.MAX_VALUE -1) (see this thread). You say that you have a 3GB record, with 1B per char, that make 3 billion characters which is more than 2^31 which is roughly equal to 2 billion.

TWhat you could do is a bit hacky but since you only have one key with a big array, it may work. Your json file might look like:

{
  "key" : ["v0", "v1", "v2"... ]
}

or like this but I think in your case it is the former:

{
  "key" : [
      "v0", 
      "v1", 
      "v2",
      ... 
   ]
}

Thus you could try changing the line delimiter used by hadoop to "," as here. Basically, they do it like this:

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

def nlFile(path: String) = {
    val conf = new Configuration
    conf.set("textinputformat.record.delimiter", ",")
    sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
          .map(_._2.toString)
}

Then you could read your array and would just have to remove the JSON brackets by yourself with something like this:

nlFile("...")
  .map(_.replaceAll("^.*\\[", "").replaceAll("\\].*$",""))

Note that you would have to be more careful if your records can contain the characters "[" and "]" but here is the idea.

Handling record size more than 3GB in spark

1 Answers1