I used rdd.map in order to extract and decode a json from a column like so:
def process_data(row):
encoded_data = json.loads(row["value"])
base64_bytes = encoded_data["payload"].encode('ascii')
ecoded_data_bytes = base64.b64decode(base64_bytes)
data = json.loads(ecoded_data_bytes.decode('ascii'), strict=False)
return data, row["file_name"], row["load_time"]
df = df.rdd.map(process_data).toDF
I got the data column as a map type, but I want it as a struct, can I do it?
A row of the data I’m working with looks like that:
{“value” = <encoded data>, “file_name”=“a”, “load_time”=1/1/1}
The encoded data(what’s in value) looks like this:
{“payload”=[
{
“key_1”={
“key_2”=val_2,
“key_3”=val_3
}
}, {
“key_1”={
“key_2”=val_2,
“key_3”=val_3
}},
}]}
To avoid this problem I also tried to use 'withColumn' to decode and load the json, but when I loaded the json with this command:
df.withColumn("payload", from_json(col("payload"), json_schema))
Every cell in "payload" returned null(even when I limited myself to only one row).
Why this kind of load does not work? is there a better way?