0

I have a pandas dataframe, sectors
with every value within each field as string and all the fields except for sector_id have null values wihtin them.

sector_id    sector_code    sector_percent    sector
----------------------------------------------------
UB1274       230;455;621    20;30;50          some_sector1
AB12312      234;786;3049   45;45;10          some_sector2
WEU234I      2344;9813      70;30             some_sector3
U2J3         3498           10                some_sector4
ALK345       ;;1289;        25;50;10;5        some_sector5
YAB45        2498;456       80                some_sector6

I'm basically trying to explode each row into multiple rows. And with some help from the stackoverflow community split-cell-into-multiple-rows-in-pandas-dataframe this is how I have been trying to do this,

from itertools import chain
def chainer(s):
    return list(chain.from_iterable(s.str.split(';')))

sectors['sector_code'].fillna(value='0', inplace=True)
sectors['sector'].fillna(value='unknown', inplace=True)
sectors['sector_percent'].fillna(value='100', inplace=True)

len_of_split = sectors['sector_code'].str.split(';').map(len) if isinstance(sectors['sector_code'], str) else 0

pd.DataFrame({
    'sector_id': np.repeat(sectors['sector_id'], len_of_split),
    'sector_code': chainer(sectors['sector_code']),
    'sector': np.repeat(sectors['sector'], len_of_split),
    'sector_percent': chainer(sectors['sector_percent'])
})

but as there are also NULL values in all the columns except for sector_id, I'm getting this error as,

ValueError: arrays must all be same length

Here's a sample code for creating the above dummy dataframe sectors,

sectors = pandas.DataFrame({'sector_id':['UB1274','AB12312','WEU234I','U2J3','ALK345','YAB45'], 'sector_code':['230;455;621','234;786;3049','2344;9813','3498',';;1289;','2498;456'], 'sector_percent':['20;30;50','45;45;10','70;30','10','25;50;10;5','80'], 'sector':['some_sector1','some_sector2','some_sector3','some_sector4','some_sector5','some_sector6']})

How do I handle this? Any help is appreciated. Thanks.

Aman Singh
  • 1,111
  • 3
  • 17
  • 31
  • The problem is from the sector_percent last line have only one value 80 – BENY Jul 22 '19 at 14:19
  • Also the third from the bottom has the only value ```30```. –  Jul 22 '19 at 14:30
  • I'm not sure if that's what you're looking for, but you could explode rows using [`pd.explode`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html#pandas.DataFrame.explode). `df = sectors.copy()`; `df1 = df.assign(sector_code=df.sector_code.str.split(';')).explode('sector_code')`; `res = df1.assign(sector_percent=df.sector_percent.str.split(';')).explode('sector_percent')` – help-ukraine-now Jul 22 '19 at 14:39
  • @political scientist wow i didn't know that there's a method in pandas for exploding. I'll try using the pandas explode method & let you know if that works – Aman Singh Jul 22 '19 at 16:51
  • @AmanSingh it worked for me but left many empty values, so I wasn’t sure if it was the solution (hence didn’t post it as an answer). Very helpful method nevertheless – help-ukraine-now Jul 22 '19 at 17:05
  • @politicalscientist if I explode one column of dataframe first and then the other column, I'm basically adding in more duplicacy, which is not how I want it. Imagine a records where there are three values in each of the two columns i.e. `sector_code` and `sector_percent`. So if split by sector_code first & then by sector_percent, I'm basically adding 9 rows in place of original 1 row when it should have just been repeated to 3 rows. – Aman Singh Jul 23 '19 at 06:26

0 Answers0