Split rows according to text in two columns (Python, Pandas)

Question

This is my dataframe (with many more letters and a length of ~35.5k) and stuff where the – are other relevant strings). All the variables are strings and ['C1','C2'] is the MultiIndex.

tmp

C1    C2     C3    C4    C5    Start    End     C8
A     1      -      -     -    12       14      -
A     2      -      -     -    1,4,7    3,6,10  -
A     3      -      -     -    16,19    17,21   -
A     4      -      -     -    22       24      -

I need it to become this (split every row that contains commas maintaining everything else):

C1    C2     C3    C4    C5    Start  End   C8   Appearance
A     1      -      -     -    12     14    -    1
A     2      -      -     -    1      3     -    1
A     2      -      -     -    4      6     -    2
A     2      -      -     -    7      10    -    3
A     3      -      -     -    16     17    -    1
A     3      -      -     -    19     21    -    2
A     4      -      -     -    22     24    -    1

I tried this script pandas: How do I split text in a column into multiple rows?

as

s = tmp['Start'].str.split(',').apply(Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Start
del tmp['Start']
final = tmp.join(s)

But then the result is much larger than it should! I get thousands of repeats and this is just trying to split 'Start'. I can't even imagine trying to do so for both Start and End (every comma in 'Start' implies a comma in 'End'.

Lengths:
tmp   = 35568
s     = 35676
final = 293408

Isn't this expected? If you have [1, 4, 7] in a row, you will have two additional rows in the result. — ayhan, Jul 15 '16 at 18:49

jezrael · Accepted Answer · 2016-07-15T19:38:41.547

3

You can create new df from s1 and s2 and then join. Also better is use parameter expand=True in str.split and delete multiple columns by drop:

For creating column Appearance use groupby by index with cumcount.

s1 = tmp['Start'].str.split(',', expand=True).stack()
s1.index = s1.index.droplevel(-1)
s1.name = 'Start'

s2 = tmp['End'].str.split(',', expand=True).stack()
s2.index = s2.index.droplevel(-1)
s2.name = 'End'
tmp.drop(['Start', 'End'], inplace=True, axis=1)

df = pd.DataFrame({'s1':s1, 's2':s2}, index=s1.index)
final = tmp.join(df)

final['Appearance'] = final.groupby(final.index).cumcount() + 1
print (final)
  C1  C2 C3 C4 C5 C8  s1  s2  Appearance
0  A   1  -  -  -  -  12  14           1
1  A   2  -  -  -  -   1   3           1
1  A   2  -  -  -  -   4   6           2
1  A   2  -  -  -  -   7  10           3
2  A   3  -  -  -  -  16  17           1
2  A   3  -  -  -  -  19  21           2
3  A   4  -  -  -  -  22  24           1

EDIT by comment:

You can try reset_index first:

print (tmp)
      C3 C4 C5  Start     End C8
C1 C2                           
A  1   -  -  -     12      14  -
   2   -  -  -  1,4,7  3,6,10  -
   3   -  -  -  16,19   17,21  -
   4   -  -  -     22      24  -

tmp.reset_index(inplace=True)
print (tmp)
  C1  C2 C3 C4 C5  Start     End C8
0  A   1  -  -  -     12      14  -
1  A   2  -  -  -  1,4,7  3,6,10  -
2  A   3  -  -  -  16,19   17,21  -
3  A   4  -  -  -     22      24  -

edited Jul 15 '16 at 19:38

answered Jul 15 '16 at 18:42

jezrael

822,522
95
1,334
1,252

"[293408 rows x 7 columns]" Shouldn't it have the same amount of rows as s1/s2? (~35k) – Nico Jul 15 '16 at 18:45
Is length `s1` and `s2` same? is index of `s1` same as index of `s2`? – jezrael Jul 15 '16 at 18:47
Yes! Every comma in Start implies a comma in End. – Nico Jul 15 '16 at 18:48
Maybe there are duplicates in index - try first use - `tmp.reset_index(inplace=True)` – jezrael Jul 15 '16 at 18:55
It works!!! I'm sure the other answer also works but since you were first I'll tick you. Thanks to both. Could you also help me to add the 'Appearance' column? – Nico Jul 15 '16 at 19:33
Thank you. Please give me a time, I add column. – jezrael Jul 15 '16 at 19:36
And maybe you forget tick solution. Thanks. – jezrael Jul 15 '16 at 20:12
I don't know why it was unticked, I swear it was ticked! By the way: your method for adding the Appearence column is counting each C2 item universally instead of locally (grouped by C1 items). Do you have a solution for this? – Nico Jul 18 '16 at 19:07
Ia m not sure if understand - what is difference between `universally` and `locally` ? – jezrael Jul 18 '16 at 19:17
Already fixed it! I needed to index by C1,C2,C3 instead of C1,C2 (which gives away Appearance.max() = 64 instead of 8). – Nico Jul 18 '16 at 19:34

score 2 · Answer 2 · answered Jul 15 '16 at 18:49

2

I concat the expanded 'Start' and 'End' columns to ensure they match up even if they don't have the same number of entries.

s = tmp.Start.str.split(',', expand=True).stack().rename('Start')
e = tmp.End.str.split(',', expand=True).stack().rename('End')
se = pd.concat([s, e], axis=1).reset_index(1, drop=True)

tmp.drop(['Start', 'End'], axis=1).merge(se, left_index=True, right_index=True)

answered Jul 15 '16 at 18:49

piRSquared

285,575
57
475
624

`TypeError: split() got an unexpected keyword argument 'expand'` – Nico Jul 15 '16 at 18:52
@Nico drop the argument. You are using an older pandas version. – piRSquared Jul 15 '16 at 18:53
Your answer is nice too ;) +1 – jezrael Jul 15 '16 at 18:55
`AttributeError: 'Series' object has no attribute 'stack'` – Nico Jul 15 '16 at 18:59
@Nico Sorry I can't fix that for you other than to suggest upgrading you pandas to 0.18.1 and try jezrael's solution or mine again. – piRSquared Jul 15 '16 at 19:01
Both use the same function. I'll try to update it in a few minutes and then I'll try again. – Nico Jul 15 '16 at 19:04

Split rows according to text in two columns (Python, Pandas)

2 Answers2

Related