0

I have a csv file with one of the columns containing value enclosed in double quotes. This column also has commas in it. How do I read this type of columns in CSV in Spark using Scala into an RDD. Column values enclosed in double quotes should be read as Integer type as they are values like Total assets, Total Debts.

example records from csv is

Jennifer,7/1/2000,0,,0,,151,11,8,"25,950,816","5,527,524",51,45,45,45,48,50,2,,
John,7/1/2003,0,,"200,000",0,151,25,8,"28,255,719","6,289,723",48,46,46,46,48,50,2,"4,766,127,272",169
ibh
  • 95
  • 13
  • I tried {val result = input.map(x => x.split(","))}. It is taking "25 as one column value and 950 as another column value from the first line. But couldn't get any more ideas. – ibh May 19 '17 at 17:50

2 Answers2

0

I would suggest you to read with SQLContext as a csv file as it has well tested mechanisms and flexible apis to satisfy your needs
You can do

val dataframe =sqlContext.read.csv("path to your csv file")

Output would be

+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
|        _c0|     _c1|_c2| _c3|    _c4| _c5|_c6|_c7|_c8|       _c9|     _c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|         _c18|_c19|
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
|   Jennifer|7/1/2000|  0|null|      0|null|151| 11|  8|25,950,816|5,527,524|  51|  45|  45|  45|  48|  50|   2|         null|null|
|Afghanistan|7/1/2003|  0|null|200,000|   0|151| 25|  8|28,255,719|6,289,723|  48|  46|  46|  46|  48|  50|   2|4,766,127,272| 169|
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+

Now you can change the header names, change the required columns to integers and do a lot of things
You can even change it to rdd
Edited
If you prefer to read in RDD and stay in RDD, then
Read the file with sparkContext as a textFile

 val rdd = sparkContext.textFile("/home/anahcolus/IdeaProjects/scalaTest/src/test/resources/test.csv")

Then split the lines with , by ignoring , in between "

rdd.map(line => line.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)", -1))
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
0

@ibh this is not Spark or Scala specific stuff. In Spark you will read file the usual way

        val conf = new SparkConf().setAppName("app_name").setMaster("local")
        val ctx  = new SparkContext(conf)
        val file = ctx.textFile("<your file>.csv")
        rdd.foreach{line =>
        // cleanup code as per regex below
        val tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1)
        // side effect
        val myObject = new MyObject(tokens)
        mylist.add(myObject)
    }

See this regex also.

Community
  • 1
  • 1
Apurva Singh
  • 4,534
  • 4
  • 33
  • 42