0

I have a .csv file with rows containing string values:

"315700H9VGE9BHU9DK42","""LEGOS s.r.o."", švédsky ""LEGOS bolag med begr.amsvar""","cs","",""

My second field occasionally contains 'strings with enclosed "quote" values':

"""LEGOS s.r.o."", švédsky ""LEGOS bolag med begr.amsvar"""

That when read into a spark dataframe the value would present:

"LEGOS s.r.o.", švédsky "LEGOS bolag med begr.amsvar"

I have tried this, and variations of the commented out options as described in these docs

df = (spark
    .read
    .format("csv")
    .option("header", True)
    .option("delimiter", ",")
    .option("multiline", True)
    .option("escapeQuotes", True)
    
    .option("quote", "\"")  
    .option("escape", "\"")

    # .option("escape", "\\")
    # .option("escape", '""') 
    # .option("escape", "\n")

    .schema(raw_schema)
    .csv(landing_schema_file)
     )

Any ideas?

I'm running on Apache Spark 3.3.0, Scala 2.12

BenP
  • 825
  • 1
  • 10
  • 30

2 Answers2

0

If you have a .csv file which contains only this row:

"315700H9VGE9BHU9DK42","""LEGOS s.r.o."", švédsky ""LEGOS bolag med begr.amsvar""","cs","",""

And you want to return:

+--------------------+-----------------------------------------------------+---+----+----+
|_c0                 |_c1                                                  |_c2|_c3 |_c4 |
+--------------------+-----------------------------------------------------+---+----+----+
|315700H9VGE9BHU9DK42|"LEGOS s.r.o.", švédsky "LEGOS bolag med begr.amsvar"|cs |null|null|
+--------------------+-----------------------------------------------------+---+----+----+

You can use:

spark.read.option("escape", "\"").csv(yourPath)

I hope this answers your question, good luck!

vilalabinot
  • 1,420
  • 4
  • 17
  • What I end up with is: _c0 = 315700H9VGE9BHU9DK42 _c1 = """LEGOS s.r.o."" _c2 = švédsky ""LEGOS bolag med begr.amsvar""" I used .option("escape", "\"") in my original example too. It fails to parse the quotes within quotes – BenP May 12 '23 at 22:20
0

OK, I think I have nailed the problem down as something to do with how Databricks and Spark process that data, rather than reading it.

I am reading a large 3GB .csv file into Databricks using. As per this answer and the answer by @vilabinot:

df = (spark
    .read
    .format("csv")
    .option("header", True)
    .option("inferSchema", True)
    .option("multiline", True)
    .option("escape", "\"")
    .csv(file_path_or_directory)
     )

One of the fields in this dataset is Country code. To test if the file read was OK, groupby and aggregate on this:

df.groupby("Country").agg(count("*").alias("count"))

Which is where erroneous values were being returned i.e. non-Country codes are returned as groups.

Country count
PL-10 1
1 TO 10, AFFIL TOWER, L. H. ROAD, 1

But, If I then try to filter the df to these values:

display(df.filter(col('Country') == "PL-10"))

Nothing is returned. So I think this is in fact something to do with the way that Spark or Databricks is trying to optimise different functions, rather than the data read step itself. A few other tests I tried which replicate this behaviour on groupby and filter were:

  • reading from json version of the data
  • splitting the .csv into smaller files e.g. here

For reference, the dataset is the GLEIF LEI index.

BenP
  • 825
  • 1
  • 10
  • 30