I have large no of tab delimited files with over 40 columns. I want to apply aggregation on it select only few columns. I think Apache Spark is the best candidate to help as my files are stored in Hadoop. I have the following program
public class MyPOJO {
int field1;
String field2; etc
}
JavaSparkContext sc;
JavaRDD<String> data = sc.textFile("path/input.csv");
JavaSQLContext sqlContext = new JavaSQLContext(sc);
JavaRDD<Record> rdd_records = sc.textFile(data).map(
new Function<String, Record>() {
public Record call(String line) throws Exception {
String[] fields = line.split(",");
MyPOJO sd = new MyPOJO(fields[0], fields[1], fields[2], fields[3]);
return sd;
}
});
Above code runs fine when I apply action rdd_record.saveAsTextFile("/to/hadoop/");
I can see it creates part-00000 file with RDD output. But when I tried to do the following
JavaSchemaRDD table = sqlContext.applySchema(rdd_records, MyPojo.class);
table.printSchema(); //prints just root and empty lines
table.saveAsTextFile("/to/hadoop/path");//prints part file with [] for each line
I don't know where is the problem MyPojo.class has all the fields why is JavaSchemaRDD empty and prints nothing in part file. I am new to Spark.