I've a DataFrame
+------------------+-------------------+--------------------+
| name| sku| description|
+------------------+-------------------+--------------------+
| Mary Rodriguez| hand-couple-manage|Senior word socia...|
| Jose Henderson| together-table-oil|Apply girl treatm...|
| Karen Villegas| child-somebody|Every tell serve....|
| Olivia Lynch|forget-matter-avoid|Perhaps environme...|
| Whitney Wiley| side-blue-dream|Quickly short soc...|
| Brittany Johnson| east-pretty|Indicate view sim...|
| Paul Morris| radio-window-us|Society month sho...|
| Jason Patterson| night-art-be-act|Entire around pla...|
| Kiara Gentry| compare-politics|Air my kind staff...|
Schema
root
|-- sku: string (nullable = true)
|-- name_description: array (nullable = true)
| |-- element: string (containsNull = true)
How can I groupby column sku
and pusing the values from name
and description
to get a column name_description
with values as an array of JSON
in the format [{"name":..., "description":...}, {"name":..., "description":...}, ....]
for each value in sku
in PySpark?