Reading CSV files contains struct type in Spark using Java

Question

I'm trying to write a test case for a program. For that, I'm reading a CSV file that has data in the following format.
account_number,struct_data 123456789,{"key1":"value","key2":"value2","keyn":"valuen"} 987678909,{"key1":"value0","key2":"value20","keyn":"valuen0"}
some hundreds of such rows.

I need to read the second column as a struct. But I'm getting the error struct type expected, string type found

I tried casting as StructType, then getting the error as "StringType cannot be converted to StructType".

Should I change the way my CSV is? What else can I do?

Do all of the struct values in the csv file contain the same schema? — Travis Hegner, Oct 10 '19 at 12:18
I'll rephrase: Is the schema for the `struct_data` field the same for all of the records? For instance, do they each have the same and same number of keys: `key1`, `key2`, etc... — Travis Hegner, Oct 10 '19 at 12:27
@TravisHegner Yes, the schema is the same for the struct_data field. — ram_23, Oct 10 '19 at 12:31
The edit that you made indicates that it is _not_ the same. If all of your json strings had the same keys, then you could use `from_json()` spark function to convert it to a struct type, but if the keys aren't the same, that won't work. — Travis Hegner, Oct 10 '19 at 14:53
@Travis My bad, I changed it. Json part does have same schema. Can you please elaborate what you are trying to say. — ram_23, Oct 10 '19 at 14:59

score 0 · Answer 1 · answered Oct 10 '19 at 12:27

0

I gave my solution in Scala Spark, it might give some insight to your query

scala> val sdf = """{"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}"""
sdf: String = {"df":[{"actNum": "1234123", "strType": [{"key1": "value1", "key2": "value2"}]}]}

scala> val erdf = spark.read.json(Seq(sdf).toDS).toDF().withColumn("arr", explode($"df")).select("arr.*")
erdf: org.apache.spark.sql.DataFrame = [actNum: string, strType: array<struct<key1:string,key2:string>>]

scala> erdf.show()
+-------+-----------------+
| actNum|          strType|
+-------+-----------------+
|1234123|[[value1,value2]]|
+-------+-----------------+


scala> erdf.printSchema
root
 |-- actNum: string (nullable = true)
 |-- strType: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key1: string (nullable = true)
 |    |    |-- key2: string (nullable = true)

answered Oct 10 '19 at 12:27

Sarath Chandra Vema

792
1
6
13

You are creating a string? in Json format?, But, I have many such rows in both columns. – ram_23 Oct 10 '19 at 12:40
As it is for testing purposes & assume it is a one time activity, you can convert csv to json & use it with this – Sarath Chandra Vema Oct 10 '19 at 13:06
I agree, But instead of writing in full, can I use my previous dataset (string, string) to convert the whole into json format? can you help me with that? – ram_23 Oct 10 '19 at 13:11
Are you able to read the data successfully ?? I assume you can read but cannot assign datatypes. – Sarath Chandra Vema Oct 10 '19 at 13:15
Yeah, I can read the data, but having difficulties in converting it into struct types. – ram_23 Oct 10 '19 at 13:28

score 0 · Answer 2 · answered Oct 10 '19 at 15:39

0

If all of the json records have the same schema, you can define that and use sparks from_json() function to accomplish your task.

import org.apache.spark.sql.types.StructType

val df = Seq(
    (123456789, "{\"key1\":\"value\",\"key2\":\"value2\",\"keyn\":\"valuen\"}"),
    (987678909, "{\"key1\":\"value0\",\"key2\":\"value20\",\"keyn\":\"valuen0\"}")
    ).toDF("account_number", "struct_data")

val schema = new StructType()
  .add($"key1".string)
  .add($"key2".string)
  .add($"keyn".string)

 val df2 = df.withColumn("st", from_json($"struct_data", schema))

 df2.printSchema
 df2.show(false)

This snippet results in this output:

root
 |-- account_number: integer (nullable = false)
 |-- struct_data: string (nullable = true)
 |-- st: struct (nullable = true)
 |    |-- key1: string (nullable = true)
 |    |-- key2: string (nullable = true)
 |    |-- keyn: string (nullable = true)

+--------------+---------------------------------------------------+------------------------+
|account_number|struct_data                                        |st                      |
+--------------+---------------------------------------------------+------------------------+
|123456789     |{"key1":"value","key2":"value2","keyn":"valuen"}   |[value,value2,valuen]   |
|987678909     |{"key1":"value0","key2":"value20","keyn":"valuen0"}|[value0,value20,valuen0]|
+--------------+---------------------------------------------------+------------------------+

answered Oct 10 '19 at 15:39

Travis Hegner

2,465
1
12
11

I changed it for java and got the desired output. But, when I run it I get totally different error. `java.lang.IllegalArgumentException: Illegal pattern component: XXX`. After exploring the internet I suspect it is because I have some date datas in "YYYY-MM-DD" format. I'm stuck. How can I give date formats in schema? If possible please give solution for `java spark`. TIA. – ram_23 Oct 14 '19 at 13:18
I believe it'd be best to pull your dates into the dataframe as strings, then convert them using the spark `to_timestamp()` function, as outlined here: https://stackoverflow.com/a/37449188/2639647 – Travis Hegner Oct 14 '19 at 18:33
Please help me with this problem. In your answer you wrote `val df = Seq( (123456789, "{\"key1\":\"value\",\"key2\":\"value2\"....`, I can't read the dataframe like this, because the second column consist of many such key-value pairs. I would like to read it from the CSV itself. Is it possible? – ram_23 Oct 15 '19 at 05:44
Certainly. Just use `Dataset df = spark.read().csv('/path/to/file')` instead. Then your data frame should have a string column for every field in the CSV. – Travis Hegner Oct 15 '19 at 13:08
But, If I read it like that, the last column gets broken up because of the commas inside it. – ram_23 Oct 15 '19 at 13:16
I tried breaking up your code, `$"key1".string` is of type StructField. I don't know how to replicate the same in Java, I tried `StrucType().add("colname", DataTypes.StringType)`, that is of string type add. Any advice on this is appreciated. – ram_23 Oct 15 '19 at 13:18
Can the CSV be modified to have the json string quoted? If not, and there are only those two fields, you could `.map()` the dataframe to concat all fields except the first. – Travis Hegner Oct 15 '19 at 15:24
[This answer](https://stackoverflow.com/a/46568709/2639647) shows a non-shortcut way to create schemas, and may be more compatible or translatable to java. – Travis Hegner Oct 15 '19 at 15:26
I can modify the csv. What change should I make? – ram_23 Oct 15 '19 at 15:32
You will have to somehow wrap the json field in quotes. If you use double quotes, then you will have to escape the quotes that are part of the json structure. If you wrap it in single quotes, then you'll have to change to using `spark.read().option("quote", "'").csv("/path/to/file.csv")`. If your csv has a header, add `.option("header", "true")`. – Travis Hegner Oct 15 '19 at 18:08

Reading CSV files contains struct type in Spark using Java

2 Answers2