1

I have a CSV file with data as below

id,name,comp_name

1,raj,"rajeswari,motors"

2,shiva,amber kings

my requirement is to read this file to spark RDD, then do map split with coma delimiter. but giving code this splits all comas val splitdata = data.map(_.split(",")

i do not want to split coma with in double quotes. But i DO NOT want to use REGEX expression. is there any simple efficient method to acheive this?

Also 2nd requirement is read above csv file to Spark Dataframe and show it but i need to see double quotes in result output should look like

id name comp_name

1 raj "rajeswari,motors"

2 shiva amber kings

double quotes are not shown normally but is any way to do it?

I am using spark 2.4 / scala 2.11 / Eclipse IDE

Roy John
  • 11
  • 2

1 Answers1

0

I would suggest try using dataframe instead of RDD?

df = spark.read.option("header", "true").csv("csv/file/path")

There won't be direct way, you have to use regex like this below to ignore "," enclosed between ""

val raw = sc.textFile("file:///tmp/stackoverflow_q_72457003.csv")
raw.map(_.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)")(2)).foreach(println)

You'd get output like this

"rajeswari,motors"

amber kings

Refer this post for understanding expression : Splitting on comma outside quotes

enter image description here

Abhishek
  • 51
  • 3
  • That was my 2nd requirement. but i need to show double quotes in data. data frame do NOT show double quotes. out put i receive do not have double quotes if i read with DF 1 raj rajeswari,motors 2 shiva amber kings – Roy John Jun 01 '22 at 06:33
  • I think using expression would be the best approach. because while splitting we need to see if a comma is enclosed by quotes. only regex has that capability and no other way for this. – Abhishek Jun 01 '22 at 12:32
  • ok thanks. i thought there may be any other solution or any direct methods as i am new to this tech i am not much familiar. – Roy John Jun 01 '22 at 13:11