I'm using CombineTextInputFormat to read many small files on Spark.
The Java code is as follows (I’ve written it as an utility function):
public static JavaRDD<String> combineTextFile(JavaSparkContext sc, String path, long maxSplitSize, boolean recursive)
{
Configuration conf = new Configuration();
conf.setLong(CombineTextInputFormat.SPLIT_MAXSIZE, maxSplitSize);
if (recursive)
conf.setBoolean(CombineTextInputFormat.INPUT_DIR_RECURSIVE, true);
return
sc.newAPIHadoopFile(path, CombineTextInputFormat.class, LongWritable.class, Text.class, conf)
.map(new Function<Tuple2<LongWritable, Text>, String>()
{
@Override
public String call(Tuple2<LongWritable, Text> tuple) throws Exception
{
return tuple._2().toString();
}
});
}
It works, but when the program runs, the following warning is printed:
WARN TaskSetManager: Stage 0 contains a task of very large size (159 KB). The maximum recommended task size is 100 KB.
The program reads about 3.5MB total, and the number of the files is 1234. The files are located in one directory.
Is this normal? Otherwise how can I get rid of this message?
My Spark version is 1.3.
The program runs in local mode.