Apache Spark JavaSchemaRDD is empty even though input RDD to it has data

Question

I have large no of tab delimited files with over 40 columns. I want to apply aggregation on it select only few columns. I think Apache Spark is the best candidate to help as my files are stored in Hadoop. I have the following program

public class MyPOJO {
int field1;
String field2; etc
}

JavaSparkContext sc;
JavaRDD<String> data = sc.textFile("path/input.csv");
JavaSQLContext sqlContext = new JavaSQLContext(sc);

JavaRDD<Record> rdd_records = sc.textFile(data).map(
  new Function<String, Record>() {
      public Record call(String line) throws Exception {
         String[] fields = line.split(",");
         MyPOJO sd = new MyPOJO(fields[0], fields[1], fields[2], fields[3]);
         return sd;
      }
});

Above code runs fine when I apply action rdd_record.saveAsTextFile("/to/hadoop/"); I can see it creates part-00000 file with RDD output. But when I tried to do the following

JavaSchemaRDD table = sqlContext.applySchema(rdd_records, MyPojo.class);
table.printSchema(); //prints just root and empty lines
table.saveAsTextFile("/to/hadoop/path");//prints part file with [] for each line

I don't know where is the problem MyPojo.class has all the fields why is JavaSchemaRDD empty and prints nothing in part file. I am new to Spark.

Not a Java guy but maybe your MyPOJO needs a toString method? — David Griffin, May 07 '15 at 11:24

score 0 · Answer 1 · answered May 07 '15 at 02:39

0

Minor observation: You said your file is Tab delimited, but you seems to split lines using ,....you may want to correct it and run? If your data is tab delimited then sd may not have a true schema

answered May 07 '15 at 02:39

ayan guha

1,249
10
7

score 0 · Accepted Answer · answered May 07 '15 at 14:08

Accoding to Spark documentation when I added getter/setter for all the fields and implemented Serializable interface to MyPojo class it started working and JavaSchemaRDD was containing data.

public class MyPOJO implements Serializable {
    private int field1;
    private String field2;
    public int getField1() {
       returns field1;
    }
    public void setField1(int field1) {
       this.field1 = field1;
    }
    public String getField2() {
       return field2;
    }
    public void setField1(String field2) {
       this.field2 = field2;
    }
    }

so, is it solved? BTW, Is this related to your last comment on http://stackoverflow.com/a/25366955/833336 ? — emecas, May 11 '15 at 13:14

Apache Spark JavaSchemaRDD is empty even though input RDD to it has data

2 Answers2

Linked