How can I combine pandas' .explode() with a .split() on multiple columns with a single additional row

Question

I want to re-arrange a Pandas Dataframe to include an extra row based on the values in two (or more) columns whenever they include a delimiter. This extra row would have all variables identical to the original, except for the columns which are searched for a delimiter. If a delimiter is found, I would want an additional row with the second elements of both columns (after the delimiter). The following code works for a single column and illustrates my goals nicely:

df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1}, {'var1': 'd,e,f', 'var2': 2}])

df.assign(var1=df.var1.str.split(',')).explode('var1').reset_index(drop=True)

Although when specifying two columns, multiple rows are included for each column, as the following code yields:

df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1, 'var3': 'I, II, III'}, {'var1': 'd,e,f', 'var2': 2, 'var3': 'IV, V, VI'}])

df.assign(var1=df.var1.str.split(','), var2=df.var2.str.split(',')).explode('var1').explode('var2').reset_index(drop=True)

I would like to have a single row included with the second element of both columns, like so:

df = pd.DataFrame([{'var1': 'a', 'var2': 1, 'var3': 'I'}, {'var1': 'b', 'var2': 1, 'var3': 'II'}, {'var1': 'c', 'var2': 1, 'var3': 'III'}, {'var1': 'd', 'var2': 2, 'var3': 'IV'}, {'var1': 'e', 'var2': 2, 'var3': 'V'}, {'var1': 'f', 'var2': 2, 'var3': 'VI'}])

I know that splitting and exploding on both columns and then subsetting the resultant dataframe would allow me to obtain the result I want, but I was looking for a potentially cleaner way to do this.

Note that for each rows, both delimiter columns will always have the same amount of delimiters.

edit

The .explode() attribute is only available in pandas >= 0.25.x

So the last DataFrame is your expected output? – Dani Mesejo Oct 07 '19 at 17:37 — Dani Mesejo, Oct 07 '19 at 17:37
correct! that was the dataframe I was looking for – Mark Verhagen Oct 07 '19 at 18:02 — Mark Verhagen, Oct 07 '19 at 18:02

rafaelc · Accepted Answer · 2019-10-07T18:15:18.567

2

In this case, if I understand correctly, it is better to explicitly explode rather than use .explode method (which is how it was done for pandas < 0.25.1). From method #2 of this thread, you may

df.var1 = df.var1.str.split(',')
df.var3 = df.var3.str.split(',')

pd.DataFrame({'var1': np.concatenate(df.var1.values),
              'var2': df.var2.repeat(df.var1.str.len()), 
              'var3': np.concatenate(df.var3.values)})

  var1  var2  var3
0    a     1     I
0    b     1    II
0    c     1   III
1    d     2    IV
1    e     2     V
1    f     2    VI

edited Oct 07 '19 at 18:15

answered Oct 07 '19 at 17:43

rafaelc

57,686
15
58
82

This seems to throw a ValueError for me in the np.concatenate() call for 'var1': ValueError: zero-dimensional arrays cannot be concatenated. Same issue for the second np.concatenate() call. – Mark Verhagen Oct 07 '19 at 18:12
@MarkVerhagen sorry, forgot to add the splits! Updated – rafaelc Oct 07 '19 at 18:15
1

Thanks @rafaelc! Works fabulously. – Mark Verhagen Oct 07 '19 at 18:20

How can I combine pandas' .explode() with a .split() on multiple columns with a single additional row

1 Answers1