I have a CSV
file that is really nasty to parse. It has columns with double quotes and commas for column row followed by JSON in another column. Example:
+----------+-------------------+--------------------------------------+-------------------------+
| column1| column2 | jsonColumn1 | jsonColumn2
+----------+-------------------+--------------------------------------+-----------------------
| 201 | "1", "ABC", "92" | [{ "Key1": 200,"Value1": 21 }, |[{"date":"9999-09-26T08:50:06Z","fakenumber":"1-877-488-2364-","fakedata":"4.20","fakedata2":"102332.06"}]
{"Key2": 200, "Value2" : 4}]
+------+--------------------------------------------------------------+---------------------------------
I need to extract it using Scala, how do I make it ignore the commas in column 2 and append a select key value pair as a new column for each row? I want it to look like this
+----------+-------------------+--------------------------------------+-------------------------+-------------------------+--------------------------------
| column1| column2 | jsonColumn1 | jsonColumn2 | jsonColumn1Key | jsonColumnDate
+----------+-------------------+--------------------------------------+-----------------------+----------------+--------------------------------------+
| 201 | "1", "ABC", "92" | Keep Orginal Record |keep original record | 200 | 9999-09-26T08:50:06Z
+------+--------------------------------------------------------------+---------------------------------
What I've done so far is import the data, create the schema
(before parsing) and then use the structfield
to add new schema to the innerjson
for the columns that have JSON.
import org.apache.spark.sql.types._
csvSchema = StructType(
.add("column1", StringType, true)
.add("column2", StringType, true)
.add("jsonColumn1", StringType, true)
.add("jsonColumn2", StringType, true)
The first issue I run into is column 2. How do I work around this? For the parsing of JSON in CSV I was going to emulate a similar solution here: split JSON value from CSV file and create new column based on json key in Spark/Scala
EDIT
csvfile = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("quote", "\"")
.option("escape", "\"")
.load("file.csv")
display(csvfile)