4

I wrote a simple Apache Spark (1.2.0) Java program to import a text file and then write it to disk using saveAsTextFile. But the output folder either has no content (just the _SUCCESS file) or at times has incomplete data (data from just 1/2 of the tasks ).

When I do a rdd.count() on the RDD, it shows the correct number, so I know the RDD correctly constructed, it is just the saveAsTextFile method which is not working.

Here is the code:

/* SimpleApp.java */
import java.util.List;

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

    public class SimpleApp {
     public static void main(String[] args) {
       String logFile = "/tmp/READ_ME.txt"; // Should be some file on your system
       SparkConf conf = new SparkConf().setAppName("Simple Application");
       JavaSparkContext sc = new JavaSparkContext(conf);
       JavaRDD<String> logData = sc.textFile(logFile);

       logData.saveAsTextFile("/tmp/simple-output");
       System.out.println("Lines -> " + logData.count());
    }
  }
maxpayne
  • 49
  • 1
  • 2
  • 1
    This is possibly a duplicate of [how to make saveAsTextFile NOT split output into multiple file](http://stackoverflow.com/questions/24371259/how-to-make-saveastextfile-not-split-output-into-multiple-file). That questions has a few answer describing ways to output to one local file. – Tobber Feb 17 '15 at 10:13

2 Answers2

5

This is because you're saving to a local path. Are you running multiple machines? so, each worker is saving to its own /tmp directory. Sometimes, you have the driver executing a task so you get part of the result locally. Really you won't want to mix distributed mode and local file systems.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • Hi Sean, thanks for your answer.. Yes, this is a 8-node standalone cluster.. I actually checked the local directories on each of the worker machines and all they have are folders with the same output directory name but contain only a _temporary folder with some worker files. But, I think it makes a lot of sense not to use distributed mode with local file system. I will give it a try with hdfs. Thanks again. – maxpayne Feb 15 '15 at 01:58
  • 1
    you could also use a NFS mount (e.g. /data) , that is visible across all nodes to read / write files. Probably simpler than setting up HDFS – Sujee Maniyam Feb 15 '15 at 06:12
0

You can try code like below(for eg)..

JavaSparkContext sc = new JavaSparkContext("local or your network IP","Application name");
JavaRDD<String> lines = sc.textFile("Path Of Your File", No. of partitions).count();

And then you print no. of lines containing in file.

Bhaumik Thakkar
  • 580
  • 1
  • 9
  • 28