1

I use the following code to load data from HDFS:

spark
  .read
  .option("header", "true")
  .option("mergeSchema", "true")
  .format("parquet")
  .load("hdfs")

when I tried to load about 3,000,000 files, I will get a exception as:

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3664)
    at java.lang.String.<init>(String.java:201)
    at java.lang.StringBuilder.toString(StringBuilder.java:407)
    at java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:3072)
    at java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2867)
    at java.io.ObjectInputStream.readString(ObjectInputStream.java:1639)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1342)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1707)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1707)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
    at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:88)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:72)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
    at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)

The file format is .snappy.parquet, and size for each file was about 100KB, for each file the schema is as:

id, String
type, String
att, String
pre, String
tag, Map[String, String]
day, Int

the partition information:

.repartition($"day", $"type", $"att")
  .write
  .partitionBy("day", "type", "att")

When I tried about 107,000 files, works fine.

For this step, the spark just load the metadata of files, why need so many memory space? Is there a limit for how many files can load from HDFS?

xkrogen
  • 624
  • 4
  • 14
  • Can you try running with KryoSerializer ? It has less memory footprint compared to Java Serializer – Constantine Feb 13 '19 at 04:19
  • Given this post https://stackoverflow.com/questions/29011574/how-does-spark-partitioning-work-on-files-in-hdfs, I get the impression many millions of partitions will be made. That strikes me as huge overhead. – thebluephantom Feb 13 '19 at 11:23

0 Answers0