I am facing a weird issue while reading a file from S3. This is what I am doing
val previousDay = spark.read
.option("header", "false")
.schema(schema)
.csv(loadPath)
.cache()
This is the schema
StructType(
Array(
StructField("location_id", DataTypes.StringType, nullable = true),
StructField("uuid", DataTypes.StringType, nullable = true),
StructField("country_code", DataTypes.StringType, nullable = true),
StructField("shard", DataTypes.StringType, nullable = true),
StructField("has_activity", DataTypes.StringType, nullable = true)
)
)
This is how the csv
"location_id","uuid","country_code","shard","has_activity"
"35fb2f0XX","06d0XX","FRA","eu","t"
"9ee98XX","7cd3c7XX","DEU","eu",""
"9d193XX","128abXX","ITA","eu",""
However when I do a show on previousDay, this is what I get
--------------------+--------------------+------------+
| lid. | uid |country |activity |shard|
+--------------------+--------------------+------------
|location_id | uuid |country_code| shard| eu|
|35fb2f0XX |6d0XX | FRA| eu| eu|
|9ee98XX |7cd3c7XX| DEU| eu| eu|
|9d193XX. |128abXX | ITA| eu| eu|
As its shown here the shard values are getting replicated across two columns and the activity is completely vanishing.
I have no idea whats happening. I will appreciate any inputs on this