I am trying to parse this csv file with the following settings.
ArrayType
"[""a"",""ab"",""avc""]"
"[1,23,33]"
"[""1"",""22""]"
"[""1"",""22"",""12222222.32342342314"",123412423523.3414]"
"[a,c,s,a,d,a,q,s]"
"["""","""","""",""""]"
"["","","",""]"
"[""abcgdjasc"",""jachdac"",""''""]"
"[""a"",""ab"",""avc""]"
val df = spark.read.format("csv").option("header","true").option("escape","\"").option("quote","\"").load("/home/ArrayType.csv")
Output:
scala> df.show()
+--------------------+
| ArrayType|
+--------------------+
| ["a","ab","avc"]|
| [1,23,33]|
| ["1","22"]|
|["1","22","122222...|
| [a,c,s,a,d,a,q,s]|
| ["","","",""]|
| [",",","]|
|["abcgdjasc","jac...|
| ["a","ab","avc"]|
+--------------------+
However since here the escape character is "\"" , I am able to read it as a single column , whereas, If the input file looks like below ,
ArrayType
"["a","ab","avc"]"
"[1,23,33]"
"["1","22"]"
"["1","22","12222222.32342342314",123412423523.3414]"
"[a,c,s,a,d,a,q,s]"
"["","","",""]"
"[",",","]"
"["abcgdjasc","jachdac","''"]"
"["a","ab","avc"]"
It shows me the following output , whereas I need it to parse the same way it did before .
scala> df.show()
+-----------------+-------+--------------------+-------------------+
| _c0| _c1| _c2| _c3|
+-----------------+-------+--------------------+-------------------+
| "["a"| ab| "avc"]"| |
| [1,23,33]| | | |
| "["1"| "22"]"| | |
| "["1"| 22|12222222.32342342314|123412423523.3414]"|
|[a,c,s,a,d,a,q,s]| | | |
| [",",","]| | | |
| [| ,| ]| |
| "["abcgdjasc"|jachdac| "''"]"| |
| "["a"| ab| "avc"]"| |
| "["a"| ab| "avc"]"| |
+------+-------------+-----------------+-------+--------------------
So, even if the string is not escaped , I still want to get the same output as the previous , without separated by comma.
How to get the second csv file as a single column in a dataframe ?
How to support both kind of files to be parsed as a single column ?
I am using univocity CSV parser for parsing.