Dataframes from pipe delimited file in SPARK

Question

I am new to SPARK so trying to do a small program and ran into the below error. Could someone help on this?

FYI - the program seems to work when there are no empty data in the columns in the sample file, but the issues seems like due to a null value in second row.

Data: Contents of TEMP_EMP.dat

1232|JOHN|30|IT
1532|DAVE|50|
1542|JEN|25|QA

SCALA code to parse this data into dataframes

import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
val employee = sc.textFile("file:///TEMP_EMP.dat")
val textFileTemp = sc.textFile("file:///TEMP_EMP.dat");
val schemaString = "ID|NAME|AGE|DEPT";
val schema = StructType(schemaString.split('|').map(fieldName=>StructField(fieldName,StringType,true)));
val rowRDD = employee.map(_.split('|')).map(e => Row(e(0),e(1),e(2), e(3) ));
val employeeDF = sqlContext.createDataFrame(rowRDD, schema);
employeeDF.registerTempTable("employee");
val allrecords = sqlContext.sql("SELECT * FROM employee");
allrecords.show();

Error Log:

WARN 2016-08-17 13:36:21,006 org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 6.0 : java.lang.ArrayIndexOutOfBoundsException: 3

score 0 · Answer 1 · answered Aug 17 '16 at 20:28

0

This line:

val rowRDD = employee.map(_.split('|')).map(e => Row(e(0),e(1),e(2), e(3) ));

You assume that the results of employee.map(_.split('|')) has at least four elements, but the second row only has 3, hence an index out of bounds exception.

To illustrate:

scala> val oneRow = "1532|DAVE|50|".split('|')
oneRow: Array[String] = Array(1532, DAVE, 50)

scala> oneRow(3)
java.lang.ArrayIndexOutOfBoundsException: 3

answered Aug 17 '16 at 20:28

Alfredo Gimenez

2,174
1
14
19

ok that makes sense. But this is pretty cammon scenario as these text files can have any columnar data as empty and we should be able to handle it in code to set the value as null. Any idea on how to handle this in this code ? – baburam1985 Aug 18 '16 at 01:19
See here: http://stackoverflow.com/questions/16231254/how-to-get-an-option-from-index-in-collection-in-scala , this also lets you use `Option` instead of `null`, which is generally better (prevents null pointer exceptions). – Alfredo Gimenez Aug 18 '16 at 16:47

score 0 · Accepted Answer · edited Dec 27 '16 at 15:27

0

This is how we should split it:

val schema = StructType(
                schemaString
                   .split("|",-1)
                   .map( fieldName => StructField(fieldName,StringType,true) )
             );

val rowRDD = employee
                .map( _.split("|", -1) )
                .map( e => Row(e(0),e(1),e(2),e(3)) );

edited Dec 27 '16 at 15:27

ddb

2,423
7
28
38

answered Dec 27 '16 at 13:43

baburam1985

21
1
5

How does this solve the initial problem of subscript of range? – swdev May 31 '17 at 17:19

Dataframes from pipe delimited file in SPARK

2 Answers2