2

I have csv file as below

name|age|county|state|country
"alex"john"|"30"|"burlington"|"nj"|"usa"

I use spark to read the csv file

input_df = spark.read.format('csv').options(header='true', inferSchema='false', sep='|').load('s3://path_to_file')

display(input_df)

Output (Not sure why we have quotes around alex"john but not around other fields)

name        age county     state    country
"alex"john" 30  burlington  nj      usa

Expected output:

name        age county     state    country
alex"john   30  burlington  nj      usa
Nats
  • 179
  • 2
  • 13

2 Answers2

2

Spark chooses to read all of name as a string(including all quotes) as the quote in the middle throws it off. Just remove first and last double quotes like this(after reading it):

from pyspark.sql import functions as F
df.withColumn("name", F.expr("""substring(name,2,length(name)-2)""")).show()

#+---------+---+----------+-----+-------+
#|name     |age|county    |state|country|
#+---------+---+----------+-----+-------+
#|alex"john|30 |burlington|nj   |usa    |
#+---------+---+----------+-----+-------+

In order to do in dynamically for all columns, I would suggest regex like this:

from pyspark.sql import functions as F
df.select(*[F.regexp_replace(x,'^\"|\"$','').alias(x) for x in df.columns]).show()

#+---------+---+----------+-----+-------+
#|name     |age|county    |state|country|
#+---------+---+----------+-----+-------+
#|alex"john|30 |burlington|nj   |usa    |
#+---------+---+----------+-----+-------+
murtihash
  • 8,030
  • 1
  • 14
  • 26
  • In real time, I do not know the column names and this needs to be handled dynamically for all columns. – Nats May 21 '20 at 00:56
  • Let's say I have input as below. ``` name|age|county|state|country "alex\"john"|"30"|"burlington"|"nj"|"usa"``` .. How do I use the escape character so I avoid this regex ? – Nats May 21 '20 at 01:05
  • that should do it : https://stackoverflow.com/questions/40413526/reading-csv-files-with-quoted-fields-containing-embedded-commas – murtihash May 21 '20 at 01:09
1

This is a tricky one given that there isn't something escaping that inner quote (like a "\").

If you don't find a way to escape the inner quote, I suggest you read the data as is and trim the surrounding quotes using the regex_replace function like so:

from pyspark.sql.functions import regexp_replace
df = spark.read.option("delimiter", "|").option("inferSchema", "true").option("header", "true").csv("tmp.csv")
df.withColumn("formatted_name", regexp_replace(df.name, '^\"|\"$', "")).show()

Output:

+-----------+---+----------+-----+-------+--------------+
|       name|age|    county|state|country|formatted_name|
+-----------+---+----------+-----+-------+--------------+
|"alex"john"| 30|burlington|   nj|    usa|     alex"john|
+-----------+---+----------+-----+-------+--------------+
Ranvir Mohanlal
  • 157
  • 2
  • 10
  • In real time, I do not know the column names and this needs to be handled dynamically for all columns – Nats May 21 '20 at 00:56