2

I have a dataframe that looks like this:

fruit   cost
apples  2
oranges 3
bananas 5
grapefruit  1

I want to pass a list that includes specified "fruit" column values and use that to duplicate those rows in the dataframe. For example, duplicated fruits = ['apples', 'oranges'].

These rows should then be copied back into the dataframe with an extra column that denotes that they are a copy (can be a binary 1/0).

Erfan
  • 40,971
  • 8
  • 66
  • 78
dataelephant
  • 563
  • 2
  • 7
  • 21
  • What is your expected output? – Erfan Nov 01 '19 at 13:10
  • Do you want `df['indicator'] = df['fruit'].isin(fruits)`? – Erfan Nov 01 '19 at 13:11
  • I only want to duplicate "oranges" and "apples," so values I have specified. The desired output is to duplicate these rows in the dataframe and create a new column marking which rows are originals and which are copies. – dataelephant Nov 01 '19 at 13:12

2 Answers2

3

Use Series.isin for get matching rows and for duplication use DataFrame.append to original data with DataFrame.assign for indicato column:

duplicated  = ['apples', 'oranges']
df1 = df[df['fruit'].isin(duplicated)].assign(new=1)
df = df.assign(new=0).append(df1, ignore_index=True)
print (df)
        fruit  cost  new
0      apples     2    0
1     oranges     3    0
2     bananas     5    0
3  grapefruit     1    0
4      apples     2    1
5     oranges     3    1

Another idea is use parameter keys in concat - it create new level filled by 0 and 1, so necessary DataFrame.reset_index by first level for convert this level for column:

df = (pd.concat([df, df1], keys=(0,1))
       .rename_axis(('new', None))
       .reset_index(level=0)
       .reset_index(drop=True))
print (df)
   new       fruit  cost
0    0      apples     2
1    0     oranges     3
2    0     bananas     5
3    0  grapefruit     1
4    1      apples     2
5    1     oranges     3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks! Follow up q - if I want to specify that row now, i.e. where apples is duplicated (fruit = apples, new = 1) and I want to replace "apples" with "apple juice," how could I do that replacement? – dataelephant Nov 01 '19 at 14:50
  • @dataelephant use `m =(df['fruit'] =='apples') & (df['new'] ==1)` and then `df.loc[m, 'fruit'] = "apple juice"` – jezrael Nov 01 '19 at 16:23
  • @dataelephant also check [this](https://stackoverflow.com/a/36910033/2901002). – jezrael Nov 01 '19 at 16:25
0

You could use concat:

result = pd.concat([df, df[df.fruit.isin(fruits)].assign(new=1)], sort=False).fillna(0)

Output

        fruit  cost  new
0      apples     2  0.0
1     oranges     3  0.0
2     bananas     5  0.0
3  grapefruit     1  0.0
0      apples     2  1.0
1     oranges     3  1.0

As an alternative you could reindex, with default_value=0, before concat:

filtered = df[df.fruit.isin(fruits)].assign(new=1)

result = pd.concat([df.reindex(columns=filtered.columns, fill_value=0), filtered], sort=False)

print(result)

Output

        fruit  cost  new
0      apples     2    0
1     oranges     3    0
2     bananas     5    0
3  grapefruit     1    0
0      apples     2    1
1     oranges     3    1
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76