I am new to SPARK so trying to do a small program and ran into the below error. Could someone help on this?
FYI - the program seems to work when there are no empty data in the columns in the sample file, but the issues seems like due to a null value in second row.
Data: Contents of TEMP_EMP.dat
1232|JOHN|30|IT
1532|DAVE|50|
1542|JEN|25|QA
SCALA code to parse this data into dataframes
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
val employee = sc.textFile("file:///TEMP_EMP.dat")
val textFileTemp = sc.textFile("file:///TEMP_EMP.dat");
val schemaString = "ID|NAME|AGE|DEPT";
val schema = StructType(schemaString.split('|').map(fieldName=>StructField(fieldName,StringType,true)));
val rowRDD = employee.map(_.split('|')).map(e => Row(e(0),e(1),e(2), e(3) ));
val employeeDF = sqlContext.createDataFrame(rowRDD, schema);
employeeDF.registerTempTable("employee");
val allrecords = sqlContext.sql("SELECT * FROM employee");
allrecords.show();
Error Log:
WARN 2016-08-17 13:36:21,006 org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 6.0 : java.lang.ArrayIndexOutOfBoundsException: 3