1

I have a CSV file that is really nasty to parse. It has columns with double quotes and commas for column row followed by JSON in another column. Example:

+----------+-------------------+--------------------------------------+-------------------------+
 |  column1|  column2          |           jsonColumn1                | jsonColumn2
+----------+-------------------+--------------------------------------+-----------------------
|  201     |  "1", "ABC", "92" | [{ "Key1": 200,"Value1": 21 },       |[{"date":"9999-09-26T08:50:06Z","fakenumber":"1-877-488-2364-","fakedata":"4.20","fakedata2":"102332.06"}]
                                 {"Key2": 200, "Value2" : 4}]  
+------+--------------------------------------------------------------+---------------------------------

I need to extract it using Scala, how do I make it ignore the commas in column 2 and append a select key value pair as a new column for each row? I want it to look like this

+----------+-------------------+--------------------------------------+-------------------------+-------------------------+--------------------------------
 |  column1|  column2          |           jsonColumn1                | jsonColumn2          |  jsonColumn1Key             | jsonColumnDate
+----------+-------------------+--------------------------------------+-----------------------+----------------+--------------------------------------+
|  201     |  "1", "ABC", "92" |       Keep Orginal Record            |keep original record  |      200                    | 9999-09-26T08:50:06Z

+------+--------------------------------------------------------------+---------------------------------

What I've done so far is import the data, create the schema (before parsing) and then use the structfield to add new schema to the innerjson for the columns that have JSON.

import org.apache.spark.sql.types._

csvSchema = StructType(
        .add("column1", StringType, true)
        .add("column2", StringType, true)
        .add("jsonColumn1", StringType, true)
        .add("jsonColumn2", StringType, true)

The first issue I run into is column 2. How do I work around this? For the parsing of JSON in CSV I was going to emulate a similar solution here: split JSON value from CSV file and create new column based on json key in Spark/Scala

EDIT

 csvfile = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("quote", "\"")
.option("escape", "\"")
.load("file.csv")
 display(csvfile)
RAGHHURAAMM
  • 1,099
  • 7
  • 15
ssjtam
  • 11
  • 2
  • What does a typical line from the csv file look like? – Shawn Oct 11 '18 at 21:18
  • 1
    And are you using a csv library to read it, or trying to write your own parser? – Shawn Oct 11 '18 at 21:19
  • The first box I showed is what the typical line looks like. This is a simplified version of it. – ssjtam Oct 11 '18 at 21:37
  • And I am using the CSV library reader. I'm open to Scala or Pyspark – ssjtam Oct 11 '18 at 21:38
  • I mean the actual lines of data. – Shawn Oct 11 '18 at 21:54
  • You should also show an minimal working example program that demonstrates how you read the file and what you're trying to do with it. That you're having a problem with commas at all suggests you're not doing something the appropriate way or there's something wrong with your csv file. So we need to see what you're doing to rule out problems. – Shawn Oct 11 '18 at 21:56
  • Put it in your question, properly formatted so it's readable. – Shawn Oct 11 '18 at 22:01
  • I fixed the double quotes issue. Now I need to find the best way to append the key-value as a new column – ssjtam Oct 11 '18 at 22:14
  • I figured it out :) Does anyone know how to convert this dataframe to a single CSV file? – ssjtam Oct 11 '18 at 22:54
  • 1
    Please post your solution as an answer to this question for others to see. @ssjtam – Jordan Parmer Oct 12 '18 at 18:15

0 Answers0