Why does reading csv file with empty values lead to IndexOutOfBoundException?

Question

I have a csv file with the foll struct

Name | Val1 | Val2 | Val3 | Val4 | Val5
John     1      2
Joe      1      2
David    1      2            10    11

I am able to load this into an RDD fine. I tried to create a schema and then a Dataframe from it and get an indexOutOfBound error.

Code is something like this ...

val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )

When I tried to perform an action on rowRDD, gives the error.

Any help is greatly appreciated.

Also shows the error which you get to understand more. – Ajay2707 Aug 31 '15 at 05:25 — Ajay2707, Aug 31 '15 at 05:25
And paste the sample CSV, maybe the error is in there – TheMP Aug 31 '15 at 15:11 — TheMP, Aug 31 '15 at 15:11

score 2 · Accepted Answer · edited May 23 '17 at 10:28

2

This is not answer to your question. But it may help to solve your problem.

From the question I see that you are trying to create a dataframe from a CSV.

Creating dataframe using CSV can be easily done using spark-csv package

With the spark-csv below scala code can be used to read a CSV val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)

For your sample data I got the following result

+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John|   1|   2|    |    |    |
|  Joe|   1|   2|    |    |    |
|David|   1|   2|    |  10|  11|
+-----+----+----+----+----+----+

You can also inferSchema with latest version. See this answer

edited May 23 '17 at 10:28

Community

1
1

answered Aug 31 '15 at 07:36

sag

5,333
8
54
91

Thank you! I will try this method. Seems to be much better than what i am trying. – Enkay Aug 31 '15 at 15:05
Not been able to get it working. downloaded and built the package but i cannot get this to work spark-shell --packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 where is the jar file expected to be? it is as of now in the path in the spark/bin folder but spark-shell cannot find the package. What am I doing wrong? – Enkay Sep 01 '15 at 04:35
I was using [databricks] (http://databricks.com/) to do that. But it should be possible via spark-shell as well. Let me check and update you – sag Sep 01 '15 at 05:09
It seems there is an issue with spark-csv README. This command works ```bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2.0```. And with this I am able to load your CSV using ```val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/home/sam/tmp/test.csv")``` – sag Sep 01 '15 at 05:19
Here is the content of the CSV that I am using ```Name,Val1,Val2,Val3,Val4,Val5 John,1,2,,, Joe,1,2,,, David,1,2,,10,11``` – sag Sep 01 '15 at 05:20
What was the location of the jar file ? Did you just put it in the "path" ? – Enkay Sep 01 '15 at 14:17
1

This worked. Thank you for all your help Samuel! appreciate it very much! – Enkay Sep 01 '15 at 14:40

score 1 · Answer 2 · answered Aug 31 '15 at 09:47

Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):

David,1,2,10,,11

The problem is your CSV file contains 6 columns, yet with:

val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )

You try to read 7 columns. Just change your mapping to:

val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))

And Spark will take care of the rest.

Hello, The p(6) was probably a typo when i keyed in the issue here. In my real program, the number of columns does match. Still have the issue. Thanks! — Enkay, Aug 31 '15 at 15:04

score 0 · Answer 3 · answered Jun 20 '17 at 14:37

The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it

David,1,2,10,,11

You can read the csv file as text file as follow

fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})

And then you can use your code to create dataframe from it

score 0 · Answer 4 · answered May 02 '19 at 03:57

You can do it as follows.

val df = sqlContext
         .read
         .textfile(csvFilePath)
         .map(_.split(delimiter_of_file, -1)
         .map(
             p => 
              Row(
                p(0), 
                p(1),
                p(2),
                p(3),
                p(4),
                p(5),
                p(6))

Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.

Why does reading csv file with empty values lead to IndexOutOfBoundException?

4 Answers4

Linked

Related