I have a pandas dataframe, sectors
with every value within each field as string and all the fields except for sector_id
have null
values wihtin them.
sector_id sector_code sector_percent sector
----------------------------------------------------
UB1274 230;455;621 20;30;50 some_sector1
AB12312 234;786;3049 45;45;10 some_sector2
WEU234I 2344;9813 70;30 some_sector3
U2J3 3498 10 some_sector4
ALK345 ;;1289; 25;50;10;5 some_sector5
YAB45 2498;456 80 some_sector6
I'm basically trying to explode each row into multiple rows. And with some help from the stackoverflow community split-cell-into-multiple-rows-in-pandas-dataframe this is how I have been trying to do this,
from itertools import chain
def chainer(s):
return list(chain.from_iterable(s.str.split(';')))
sectors['sector_code'].fillna(value='0', inplace=True)
sectors['sector'].fillna(value='unknown', inplace=True)
sectors['sector_percent'].fillna(value='100', inplace=True)
len_of_split = sectors['sector_code'].str.split(';').map(len) if isinstance(sectors['sector_code'], str) else 0
pd.DataFrame({
'sector_id': np.repeat(sectors['sector_id'], len_of_split),
'sector_code': chainer(sectors['sector_code']),
'sector': np.repeat(sectors['sector'], len_of_split),
'sector_percent': chainer(sectors['sector_percent'])
})
but as there are also NULL
values in all the columns except for sector_id
, I'm getting this error as,
ValueError: arrays must all be same length
Here's a sample code for creating the above dummy dataframe sectors
,
sectors = pandas.DataFrame({'sector_id':['UB1274','AB12312','WEU234I','U2J3','ALK345','YAB45'], 'sector_code':['230;455;621','234;786;3049','2344;9813','3498',';;1289;','2498;456'], 'sector_percent':['20;30;50','45;45;10','70;30','10','25;50;10;5','80'], 'sector':['some_sector1','some_sector2','some_sector3','some_sector4','some_sector5','some_sector6']})
How do I handle this? Any help is appreciated. Thanks.