1

I have a humongous dataframe with multiple types of columns - string, boolean, integer, float (this is important, because it means that I cant use np.repeat for this problem. Which is why I'm asking my own because I believe similar solutions on here dont work for me. Either that, or I dont know how!). Well, one of my columns is an ID number and for some reason some rows have multiple numbers listed under ID. Something like this:

i  ID            Name        Boolean1  Boolean2   etc
0  2755          Blahblah1   True      False      ...
1  2894, 4755    PainInMy2   True      True       ...
2  331           Blehblue    False     False      ...

I wanna split this painful row in a way that each ID number is on a Separate row and All other values get duplicated Under it. i.e.

i  ID      Name        Boolean1  Boolean2   etc
0  2755    Blahblah1   True      False      ...
1  2894    PainInMy2   True      True       ...
2  4755    PainInMy2   True      True       ...
3  331     Blehblue    False     False      ...

What is an elegant way I can achieve this? Keep in mind this is a huge Pandas df with hundreds of thousands of rows and a dozen columns of DIFFerent types; and I would like to keep most, if not all, of pandas df metadata for it. I can butcher it with series of fors and ifs, but I feel like there should be easier, possibly just a couple lines, way to do this. Maybe with split(',') or something similar. But I can't figure out how. Thanks!

(Please dont mark this as duplicate. I couldnt find any answer that worked for me!)

A. K.
  • 25
  • 8
  • 1
    Which duplicated did you try? I can see many. Try https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows? – Vaishali Oct 30 '18 at 22:36
  • Try df.set_index(['Name', 'Boolean1', 'Boolean2']).ID.str.split(',', expand=True).stack().reset_index(3,drop = True).reset_index().rename(columns = {0 : 'ID'}) – Vaishali Oct 30 '18 at 22:41
  • @Vaishali this was amazing! worked beautifully; and you can do it period to period to literally see every step. Loved it! Thanks! Edit: by "this" I meant your one-liner; the other solution took 6-7 times longer – A. K. Oct 31 '18 at 15:27
  • Tha'ts great. I guess I will post it as an answer given that you couldn't work with the dupe – Vaishali Oct 31 '18 at 15:43

1 Answers1

0

You can use pandas str methods to split the cell on the given separator and stack to expand into multiple rows. Set rest of the columns as index to ensure that the data corresponds to the correct value.

df.set_index(['Name', 'Boolean1', 'Boolean2']).ID.str.split(',', expand=True)\
.stack().reset_index(3,drop = True).reset_index()\
.rename(columns = {0 : 'ID'})

You get

    Name       Boolean1 Boolean2  ID
0   Blahblah1   True    False     2755
1   PainInMy2   True    True      2894
2   PainInMy2   True    True      4755
3   Blehblue    False   False     331
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • Hey, I found a slightly more intuitive way, with using only one `reset_index` . You can do one full reset and drop the extra column. So basically, after `stack()` it becomes `reset_index().drop("level_3", axis=1).rename(columns = {0 : 'ID'})` – A. K. Nov 01 '18 at 18:07
  • Yes, in principal it involves two steps: either reset index with drop parameter which returns a multi index and hence another reset index or reset index with drop outside of it. Both are valid approaches – Vaishali Nov 01 '18 at 18:12