Spark reading dataset wrong and weird

Question

I am facing a weird issue while reading a file from S3. This is what I am doing

val previousDay = spark.read
      .option("header", "false")
      .schema(schema)
      .csv(loadPath)
      .cache()

This is the schema

StructType(
    Array(
      StructField("location_id", DataTypes.StringType, nullable = true),
      StructField("uuid", DataTypes.StringType, nullable = true),
      StructField("country_code", DataTypes.StringType, nullable = true),
      StructField("shard", DataTypes.StringType, nullable = true),
      StructField("has_activity", DataTypes.StringType, nullable = true)
    )
  )

This is how the csv

"location_id","uuid","country_code","shard","has_activity"
"35fb2f0XX","06d0XX","FRA","eu","t"
"9ee98XX","7cd3c7XX","DEU","eu",""
"9d193XX","128abXX","ITA","eu",""

However when I do a show on previousDay, this is what I get

--------------------+--------------------+------------+
| lid.       |    uid |country     |activity    |shard|
+--------------------+--------------------+------------
|location_id |   uuid |country_code|       shard|   eu|
|35fb2f0XX   |6d0XX   |         FRA|          eu|   eu|
|9ee98XX     |7cd3c7XX|         DEU|          eu|   eu|
|9d193XX.    |128abXX |         ITA|          eu|   eu|

As its shown here the shard values are getting replicated across two columns and the activity is completely vanishing.

I have no idea whats happening. I will appreciate any inputs on this

Shouldn't there be a `StructType` enclosing the schema definition? — jrook, Oct 01 '20 at 03:25
I can't reproduce your output. Anyways, the output dataframe doesn't make sense. The column names should come from the schema. Why would it name the first column `lid`? — jrook, Oct 01 '20 at 04:49
Hi @Yogi, I wasn't able to reproduce your code in a `S3` context, but in a `HDFS` context changing `.option("header", "false")` to `.option("header", "true")` is working fine. I can't figure out what is happening. — Chema, Oct 01 '20 at 07:34
You’re missing information for anyone to answer this definitely ; most likely your csv is partitioned and one of the files has a broken header — Nick, Oct 01 '20 at 07:50
or it is a parsing issue and you might need to escape double quotes, please check a related post https://stackoverflow.com/questions/40413526/reading-csv-files-with-quoted-fields-containing-embedded-commas — abiratsis, Oct 01 '20 at 14:47

Spark reading dataset wrong and weird

0 Answers0