How to repeat rows based on value of a column in Python

Question

I have the following df that comprises of code/product & weeks columns.

code.  Product  .   weeks
123 .  product1 .    1;2
123 .  product1 .    3
321 .  product2 .    4;5;6
321 .  product2 .    7

For those rows that have more than 1 week (eg 1;2 or 4;5;6), I want to repeat these rows. My Desirsed output is as follows:

code.  Product  .   weeks
123 .  product1 .    1
123 .  product1 .    2
123 .  product1 .    3
321 .  product2 .    4
321 .  product2 .    5
321 .  product2 .    6
321 .  product2 .    7

What is the best approach to take using pandas or numpy?

Possible duplicate of [How to explode a list inside a Dataframe cell into separate rows](https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows) — Matthias, Jul 23 '18 at 07:37

jezrael · Accepted Answer · 2018-07-23T07:51:58.503

2

Use:

df = (df.set_index(['code','Product'])['weeks']
       .str.split(';', expand=True)
       .stack()
       .reset_index(level=2, drop=True)
       .reset_index(name='weeks'))
print (df)
   code   Product weeks
0   123  product1     1
1   123  product1     2
2   123  product1     3
3   321  product2     4
4   321  product2     5
5   321  product2     6
6   321  product2     7

Explanation:

First set_index by all repeated columns
Create DataFrame by split
Reshape by stack
Last data cleaning by reset_index

Another solution:

from itertools import chain

weeks = df['weeks'].str.split(';')
lens = weeks.str.len()
df = pd.DataFrame({
    'code' : df['code'].repeat(lens),
    'Product' : df['Product'].repeat(lens),
    'weeks' : list(chain.from_iterable(weeks.values.tolist())), 
})

print (df)
   code   Product weeks
0   123  product1     1
0   123  product1     2
1   123  product1     3
2   321  product2     4
2   321  product2     5
2   321  product2     6
3   321  product2     7

Explanation:

Create lists by split
Get lengths of lsits by len
Last repeat columns and flatten weeks

edited Jul 23 '18 at 07:51

answered Jul 23 '18 at 07:39

jezrael

822,522
95
1,334
1,252

The first solution appears to be working correctly except for the fact that in my df, it excludes rows that don't have (' ; ') in the ['weeks'] column value. To relate back to the original df above, data for weeks 3 and week 7 are not appearing. I think it has something to do with the str.split function accidentally excluding the additional data. Your help in finding the last missing piece of code will be most helpful! – jwlon81 Jul 24 '18 at 00:47
@jwlon81 - So first solution failed and second working? If data are not confidental, is possible send me csv to my email in my profile for test it? – jezrael Jul 24 '18 at 05:12
1

Hi Jezrael, I'm not sure what happened but I ran the code again and I now works! Thanks – jwlon81 Jul 24 '18 at 05:47

score 0 · Answer 2 · answered Jul 23 '18 at 07:40

#assume test.xlsx is your data
test = pd.read_excel('test.xlsx')  
test_processed = pd.DataFrame(columns=test.columns)
for index, row in test.iterrows():
   weeks = row['weeks'].split(';')
   for week in weeks:
       test_processed = test_processed.append({'code':row['code'], 'Product':row['Product'],'weeks':week}, ignore_index=True)

How to repeat rows based on value of a column in Python

2 Answers2