How to tune Spark application with hadoop custom input format

Question

My spark application process the files (average size is 20 MB) with custom hadoop input format and stores the result in HDFS.

Following is the code snippet.

Configuration conf = new Configuration();


JavaPairRDD<Text, Text> baseRDD = ctx
    .newAPIHadoopFile(input, CustomInputFormat.class,Text.class, Text.class, conf);

JavaRDD<myClass> mapPartitionsRDD = baseRDD
    .mapPartitions(new FlatMapFunction<Iterator<Tuple2<Text, Text>>, myClass>() {
        //my logic goes here
    }

//few more translformations
result.saveAsTextFile(path);

This application creates 1 task/ partition per file and processes and stores the corresponding part file in HDFS.

i.e, For 10,000 input files 10,000 tasks are created and 10,000 part files are stored in HDFS.

Both mapPartitions and map operations on baseRDD are creating 1 task per file.

SO question How to set the number of partitions for newAPIHadoopFile? suggests to set conf.setInt("mapred.max.split.size", 4); for configuring no of partitions.

But when this parameter is set CPU is utilized at maximum and none of the stage is not started even after long time.

If I don't set this parameter then application will be completed successfully as mentioned above.

How to set number of partitions with newAPIHadoopFile and increase the efficiency?

What happens with mapred.max.split.size option?

============

update: What happens with mapred.max.split.size option?

In my use case file size is small and changing the split size options are irrelevant here.

more info on this SO: Behavior of the parameter "mapred.min.split.size" in HDFS

mapred.max.split.size specifies the size in bytes, I think – yjshen May 04 '15 at 16:57 — yjshen, May 04 '15 at 16:57

score 0 · Accepted Answer · answered May 04 '15 at 23:14

0

Just use baseRDD.repartition(<a sane amount>).mapPartitions(...). That will move the resulting operation to fewer partitions, especially if your files are small.

answered May 04 '15 at 23:14

Daniel Langdon

5,899
4
28
48

How to tune Spark application with hadoop custom input format

1 Answers1