1

I have a DataFrame as below with all three columns having strings:

Column A Column B Column C
Apple red, yellow, blue Texas, California
Banana yellow, orange Indiana, New Zealand
Watermelon grey

I would like to split each row into multiple rows based on a delimiter value. The final output would look like below.

Column A Column B Column C
Apple red Texas
Apple yellow California
Apple blue
Banana yellow Indiana
Banana orange New Zealand
Watermelon grey

Note:

  1. Delimiters would only be present in Column B and Column C.
  2. The splits need not be equal. For instance, In row 1, the max splits across Column B and Column C are 3. But since column C has only 2 splits, the third row would result in column C as null/blank.

I tried following this stack overflow post, but couldn't apply it in my case here since each column would have different values based on splits across multiple columns, and since I am dealing with string splits.

Krishna
  • 115
  • 6
  • May I know why don't you respond to the below answers posted by users after so much time and effort invested by them? – SomeDude Mar 16 '23 at 15:53

3 Answers3

1

You can split your strings, then pad the lists in each row so that they all have the same length and finally apply pd.Series.explode:

df = df.fillna('').set_index('Column A')
df = df.apply(lambda x: x.str.split(r'\s*,\s*'))

def pad_lists(row):
    max_len = max(len(x) for x in row)
    return row.apply(lambda x: x + ['']*(max_len-len(x)))

print(df.apply(pad_lists, axis=1).apply(pd.Series.explode).reset_index())

Output:

     Column A Column B     Column C
0       Apple      red        Texas
1       Apple   yellow   California
2       Apple     blue             
3      Banana   yellow      Indiana
4      Banana   orange  New Zealand
5  Watermelon     grey             

Edit: setting 'Column A' as index (optionally resetting at the end) to make code less verbose and removing column names (B,C) from code for better generalization

Tranbi
  • 11,407
  • 6
  • 16
  • 33
1

try this:

from itertools import zip_longest


tmp = df.set_index('Column A').apply(lambda x: x.str.split(','))
result = (tmp.where(tmp.notnull(), '')
          .apply(lambda x: list(zip_longest(*x)), axis=1)
          .explode()
          .apply(pd.Series)
          .set_axis(tmp.columns, axis=1)
          )
print(result)
>>>
            Column B    Column C
Column A        
Apple       red         Texas
Apple       yellow      California
Apple       blue        None
Banana      yellow      Indiana
Banana      orange      New Zealand
Watermelon  grey        None
ziying35
  • 1,190
  • 3
  • 6
  • 1
    Nice use of `zip_longest`! I think you can replace `where(...)` with `fillna('')`. Also `zip_longest` will accept `fillvalue=''` as kwarg. – Tranbi Mar 08 '23 at 06:12
  • @Tranbi, If I set fillvalue = '' in zip_longest, it will prompt the error: TypeError: 'float' object is not iterable – ziying35 Mar 08 '23 at 06:43
  • That's... strange. For me your code works with the following lambda function: `lambda x: list(zip_longest(*x, fillvalue=''))`. I use python 3.10.9 and pandas 1.5.3 – Tranbi Mar 08 '23 at 06:53
  • I use python 3.8.10 and pandas 1.5.1 – ziying35 Mar 08 '23 at 07:15
  • Have you checked for typos? Cause there should be any float object in there... – Tranbi Mar 08 '23 at 09:21
-1

Option 1

We could let pandas concat fill the lengths and then reshape to get the columns with lists filled equally and then explode.

list_cols = ["Column B", "Column C"]
df[list_cols] = (
    pd.concat([df[col].str.split("\s*,\s*", expand=True) for col in list_cols])
    .fillna("")
    .agg(list, axis=1)
    .values.reshape(len(df), len(list_cols), order="F")
)
df = df.explode(list_cols, ignore_index=True)
df = df[df[list_cols].ne("").any(axis=1)]
print(df)

     Column A Column B     Column C
0       Apple      red        Texas
1       Apple   yellow   California
2       Apple     blue             
3      Banana   yellow      Indiana
4      Banana   orange  New Zealand
6  WaterMelon     grey

Option 2

Make columns B, C equal in length by padding with nan and then explode.

df["Column B"] = df["Column B"].str.split("\s*,\s*")
df["Column C"] = df["Column C"].str.split("\s*,\s*")
b_len = df["Column B"].str.len()
c_len = df["Column C"].str.len()

to_repeat = list(map(max, zip(b_len, c_len)))

df["Column C"] = df["Column C"] + pd.Series([[np.nan]*r for r in to_repeat - c_len])
df["Column B"] = df["Column B"] + pd.Series([[np.nan]*r for r in to_repeat - b_len])

df = df.explode(["Column B", "Column C"], ignore_index=True).fillna("")
print(df)

     Column A Column B     Column C
0       Apple      red        Texas
1       Apple   yellow   California
2       Apple     blue             
3      Banana   yellow      Indiana
4      Banana   orange  New Zealand
5  WaterMelon     grey 
SomeDude
  • 13,876
  • 5
  • 21
  • 44