1

I have a dataframe in which I have a column and each row contains a list of dictionaries :

[
Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'},{...}]"),
Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'},{...}]")
]

How can I parse it to a dataframe structure like this:

key1  | key2 | key3 | keyN |
value1|value2|value3|valueN|
value1|value2|value3|valueN|
Mohamed Ali JAMAOUI
  • 14,275
  • 14
  • 73
  • 117
TCreuillenet
  • 83
  • 1
  • 5
  • You may have a look at [from_json](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.from_json) for your purpose. – pansen Jan 31 '18 at 10:44
  • Possible duplicate of [How to cast string to ArrayType of dictionary (JSON) in PySpark](https://stackoverflow.com/questions/51713790/how-to-cast-string-to-arraytype-of-dictionary-json-in-pyspark) – devinbost Sep 07 '18 at 00:45

2 Answers2

0

You can proceed as follows:

from pyspark.sql import Row 
l = [Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]"), 
     Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]")]

# convert the list of Rows to an RDD: 
ll = sc.parallelize(l) 
df = sqlContext.read.json(ll.map(lambda r: dict(
                          kv for d in eval(r.payload) for kv in d.items())))

Explanation:

I guess the only ambiguity is in the following intermediate code:

dict(kv for d in eval(r.payload) for kv in d.items())

is used to convert from this format

[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]"

to this one:

{'key3': 'value3', 'key2': 'value2', 'key1': 'value1'}

output:

>>>df
DataFrame[key1: string, key2: string, key3: string]
>>> df.show() 
+------+------+------+
|  key1|  key2|  key3|
+------+------+------+
|value1|value2|value3|
|value1|value2|value3|
+------+------+------+
Mohamed Ali JAMAOUI
  • 14,275
  • 14
  • 73
  • 117
  • @TCreuillenet, Glad I was able to help :), consider accepting the answer ;) – Mohamed Ali JAMAOUI Jan 31 '18 at 13:42
  • Actually this result gives some columns that I'd to add to the original dataframe. What would be the best way ? – TCreuillenet Jan 31 '18 at 14:55
  • 1
    I'm amazed that both the question from the OP and the accepted answer were both given negative points from downvotes... – devinbost Sep 07 '18 at 00:43
  • @devinbost: Stackoverflow should be more restricted with downvoters. There should be some kind of validation before downvote is accepted. Now it's only based on a rank of a user. – Markus Sep 08 '18 at 19:44
-1

To get expected dataframe structure:

import pandas as pd
from pyspark.sql import *

dataframe = [
Row(payload=u"[{'key1':'value1'},{'key2':'value2'},{'key3':'value3'}]"),
Row(payload=u"[{'key1':'value4'},{'key2':'value5'},{'key3':'value6'}]")]

new_data = [eval(row['payload']) for row in dataframe]
# [[{'key1': 'value1'}, {'key2': 'value2'}, {'key3': 'value3'}], [{'key1': 'value4'}, {'key2': 'value5'}, {'key3': 'value6'}]]

data_list = []
for sub_list in new_data:
    dict_list = {}
    for dict_val in sub_list:
        dict_list.update(dict_val)
    data_list.append(dict_list)
# [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}, {'key1': 'value4', 'key2': 'value5', 'key3': 'value6'}]

df = pd.DataFrame(data_list)

#     key1    key2    key3
# 0  value1  value2  value3
# 1  value4  value5  value6
Anup
  • 200
  • 1
  • 13