4

I have a csv file with the foll struct

Name | Val1 | Val2 | Val3 | Val4 | Val5
John     1      2
Joe      1      2
David    1      2            10    11

I am able to load this into an RDD fine. I tried to create a schema and then a Dataframe from it and get an indexOutOfBound error.

Code is something like this ...

val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )

When I tried to perform an action on rowRDD, gives the error.

Any help is greatly appreciated.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Enkay
  • 83
  • 1
  • 6

4 Answers4

2

This is not answer to your question. But it may help to solve your problem.

From the question I see that you are trying to create a dataframe from a CSV.

Creating dataframe using CSV can be easily done using spark-csv package

With the spark-csv below scala code can be used to read a CSV val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)

For your sample data I got the following result

+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John|   1|   2|    |    |    |
|  Joe|   1|   2|    |    |    |
|David|   1|   2|    |  10|  11|
+-----+----+----+----+----+----+

You can also inferSchema with latest version. See this answer

Community
  • 1
  • 1
sag
  • 5,333
  • 8
  • 54
  • 91
  • Thank you! I will try this method. Seems to be much better than what i am trying. – Enkay Aug 31 '15 at 15:05
  • Not been able to get it working. downloaded and built the package but i cannot get this to work spark-shell --packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 where is the jar file expected to be? it is as of now in the path in the spark/bin folder but spark-shell cannot find the package. What am I doing wrong? – Enkay Sep 01 '15 at 04:35
  • I was using [databricks] (http://databricks.com/) to do that. But it should be possible via spark-shell as well. Let me check and update you – sag Sep 01 '15 at 05:09
  • It seems there is an issue with spark-csv README. This command works ```bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2.0```. And with this I am able to load your CSV using ```val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/home/sam/tmp/test.csv")``` – sag Sep 01 '15 at 05:19
  • Here is the content of the CSV that I am using ```Name,Val1,Val2,Val3,Val4,Val5 John,1,2,,, Joe,1,2,,, David,1,2,,10,11``` – sag Sep 01 '15 at 05:20
  • What was the location of the jar file ? Did you just put it in the "path" ? – Enkay Sep 01 '15 at 14:17
  • 1
    This worked. Thank you for all your help Samuel! appreciate it very much! – Enkay Sep 01 '15 at 14:40
1

Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):

David,1,2,10,,11

The problem is your CSV file contains 6 columns, yet with:

val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )

You try to read 7 columns. Just change your mapping to:

val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))

And Spark will take care of the rest.

TheMP
  • 8,257
  • 9
  • 44
  • 73
  • Hello, The p(6) was probably a typo when i keyed in the issue here. In my real program, the number of columns does match. Still have the issue. Thanks! – Enkay Aug 31 '15 at 15:04
0

The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it

David,1,2,10,,11

You can read the csv file as text file as follow

fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})

And then you can use your code to create dataframe from it

0

You can do it as follows.

val df = sqlContext
         .read
         .textfile(csvFilePath)
         .map(_.split(delimiter_of_file, -1)
         .map(
             p => 
              Row(
                p(0), 
                p(1),
                p(2),
                p(3),
                p(4),
                p(5),
                p(6))

Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.

CRV
  • 69
  • 8