I wrote a simple Apache Spark (1.2.0) Java program to import a text file and then write it to disk using saveAsTextFile. But the output folder either has no content (just the _SUCCESS file) or at times has incomplete data (data from just 1/2 of the tasks ).
When I do a rdd.count() on the RDD, it shows the correct number, so I know the RDD correctly constructed, it is just the saveAsTextFile method which is not working.
Here is the code:
/* SimpleApp.java */
import java.util.List;
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
public class SimpleApp {
public static void main(String[] args) {
String logFile = "/tmp/READ_ME.txt"; // Should be some file on your system
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile);
logData.saveAsTextFile("/tmp/simple-output");
System.out.println("Lines -> " + logData.count());
}
}