0

I'm using CombineTextInputFormat to read many small files on Spark.

The Java code is as follows (I’ve written it as an utility function):

public static JavaRDD<String> combineTextFile(JavaSparkContext sc, String path, long maxSplitSize, boolean recursive)
{
    Configuration conf = new Configuration();
    conf.setLong(CombineTextInputFormat.SPLIT_MAXSIZE, maxSplitSize);
    if (recursive)
        conf.setBoolean(CombineTextInputFormat.INPUT_DIR_RECURSIVE, true);
    return
        sc.newAPIHadoopFile(path, CombineTextInputFormat.class, LongWritable.class, Text.class, conf)
        .map(new Function<Tuple2<LongWritable, Text>, String>()
        {
            @Override
            public String call(Tuple2<LongWritable, Text> tuple) throws Exception
            {
                return tuple._2().toString();
            }
        });
}

It works, but when the program runs, the following warning is printed:

WARN TaskSetManager: Stage 0 contains a task of very large size (159 KB). The maximum recommended task size is 100 KB.

The program reads about 3.5MB total, and the number of the files is 1234. The files are located in one directory.

Is this normal? Otherwise how can I get rid of this message?

My Spark version is 1.3.

The program runs in local mode.

zeodtr
  • 10,645
  • 14
  • 43
  • 60

1 Answers1

0

Independently from your problem, for which I do not have an answer, you might want to try a different method to process all files under a directory.

Spark supports processing not only single files, but entire directories in an easy way. If all your files were to be located, as in your case, in a single directory, the command sc.textfile could read every file inside of it by specifying the location, for example:

sc.textfile("//my/folder/with/files");

You can find more information about it in the following question: How to read multiple text files into a single RDD?

Community
  • 1
  • 1
Mikel Urkia
  • 2,087
  • 1
  • 23
  • 40
  • 2
    It's true, but CombineTextInputFormat is way faster in my case. It took 9 secs to finish the job while sc.textFile() took 300 secs. (Yes, three hundred seconds) – zeodtr Mar 31 '15 at 07:23
  • I find it courious. What kind of operations are you performing on the dataset? Could you provide a sample of your app in order to test it myself? – Mikel Urkia Mar 31 '15 at 07:25
  • Currently it's only a test on my PC. The problem is that currently the number of test input file is 1234, and the size of each file ranges from 1KB ~ 5KB which is pretty uncommon case for a hadoop application. The test input file will be way larger on a real system. So, it may not be a problem, but for the test I need CombineTextInputFormat. – zeodtr Apr 01 '15 at 00:07
  • I just tested a wordcount example with 1000 files (5KB each) and indeed, spark needed 160 seconds to finish the process (1 core). Not 300 seconds, but still much more than 9 seconds. As if you say, in the future you plan using much heavier datasets on a real system, however, it might not be an issue at all. I am sorry for not being able to help more here... – Mikel Urkia Apr 01 '15 at 09:46