explode each column of dataframe without data duplication

Question

I am using pyspark or pandas

I have this dataframe and each column contain a list of value:

COL1      |  COL2               |  COL3
["A","B"] |  ["V1", "V2", "V3"] |  ["V4","V5"]

I need to explode each column to present data differently and generate this dataframe, there is no data order between column :

COL1      |  COL2               |  COL3
A         |    V1               |   V4
B         |    V2               |   V5
null      |    V3               |   null

thank you for your help

possible duplicate of [this](https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe). — Quang Hoang, Dec 16 '21 at 15:37

score 2 · Answer 1 · answered Dec 16 '21 at 15:44

2

Use pd.concat:

out = pd.concat([sr.explode(ignore_index=True) for _, sr in df.iteritems()], axis=1)
print(out)

# Output:
  COL1 COL2 COL3
0    A   V1   V4
1    B   V2   V5
2  NaN   V3  NaN

answered Dec 16 '21 at 15:44

Corralien

109,409
8
28
52

score 0 · Answer 2 · answered Dec 16 '21 at 15:38

Try this:

df = df.apply(lambda row: row.apply(lambda cell: cell + [np.nan] * (max(row.apply(len)) - len(cell))), axis=1).explode(df.columns.tolist())

Output:

>>> df
  COL1 COL2 COL3
0    A   V1   V4
0    B   V2   V5
0  NaN   V3  NaN

score 0 · Answer 3 · answered Dec 16 '21 at 16:05

Use Spark inbuilt functions arrays_zip and explode to get the required output.

Example:

df.select(explode(arrays_zip(col("col1"),col("col2"),col("col3")))).\
select("col.*").\
show(10,False)

#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|a   |v1  |v4  |
#|b   |v2  |v5  |
#|null|v3  |null|
#+----+----+----+

explode each column of dataframe without data duplication

3 Answers3