I have a dataframe containing an array of rows on each row
I want to aggregate all the inner rows into one dataframe
Below is what I have / achieved:
This
df.select('*').take(1)
Gives me this:
[
Row(
body=[
Row(a=1, b=1),
Row(a=2, b=2)
]
)
]
So doing this:
df.rdd.flatMap(lambda x: x).collect()
I get this:
[[
Row(a=1, b=1)
Row(a=2, b=2)
]]
So I am forced to do this:
df.rdd.flatMap(lambda x: x).flatMap(lambda x: x)
So I can achieve the below:
[
Row(a=1, b=1)
Row(a=2, b=2)
]
Using the result above, I can finally convert it to a dataframe and save somewhere. Which is what I want. But calling flatMap twice doesnt look right.
I tried to the same by using Reduce, just like the following code:
flatRdd = df.rdd.flatMap(lambda x: x)
dfMerged = reduce(DataFrame.unionByName, [flatRdd])
The second argument of reduce must be iterable, so I was forced to add [flatRdd]. Sadly it gives me this:
[[
Row(a=1, b=1)
Row(a=2, b=2)
]]
There is certainlly a better way to achieve what I want.