1

I have following pandas df :

import pandas as pd
import numpy as np    

pd_df = pd.DataFrame({'Qu1': ['apple', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'egg'],
              'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', np.nan, 'banana', 'banana', 'banana'],
              'Qu3': ['apple', 'potato', 'sausage', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'egg']})

I'd like to implement where() on two columns only Qu1 and Qu2 and keep the rest original stackoverflow question , so I created pd1

pd1 = pd_df.where(pd_df.apply(lambda x: x.map(x.value_counts()))>=2,
                              "other")[['Qu1', 'Qu2']]

Then I added a rest of pd_df,pd_df['Qu3'] to pd1

pd1['Qu3'] = pd_df['Qu3']
pd_df = []

My question is : Originally I want to execute where() on part of df and keep rest of columns as is, so could the code above be dangerous for large dataset ? Can I harm the original data this way ? If yes what the best way to do it ?

Thanks a lot !

Community
  • 1
  • 1
Toren
  • 6,648
  • 12
  • 41
  • 62

1 Answers1

2

You could just explicitly take a copy of the orig df and then overwrite on a selection of that df:

In [40]:
pd1 = pd_df.copy()
pd1[['Qu1', 'Qu2']] = pd1[['Qu1', 'Qu2']].where(pd_df.apply(lambda x: x.map(x.value_counts()))>=2,
                              "other")
pd1

Out[40]:
      Qu1     Qu2      Qu3
0   other   other    apple
1  potato  banana   potato
2  cheese   apple  sausage
3  banana   apple   cheese
4  cheese   apple   cheese
5  banana   other   potato
6  cheese  banana   cheese
7  potato  banana   potato
8   other  banana      egg

So the difference here is that we only operate on a section of the df, rather than the whole df and then select the cols of interest

update

If you want to just overwrite those cols then just select those:

In [48]:
pd_df[['Qu1', 'Qu2']] = pd_df[['Qu1', 'Qu2']].where(pd_df.apply(lambda x: x.map(x.value_counts()))>=2,
                              "other")
pd_df

Out[48]:
      Qu1     Qu2      Qu3
0   other   other    apple
1  potato  banana   potato
2  cheese   apple  sausage
3  banana   apple   cheese
4  cheese   apple   cheese
5  banana   other   potato
6  cheese  banana   cheese
7  potato  banana   potato
8   other  banana      egg
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thanks ! My data set is approximately 30G , would `copy` produce another 30G data set in memory ? – Toren May 19 '16 at 08:45
  • Your question showed you creating a copy of the 2 cols, if you just want to overwrite those columns in the original then you can just remove the `copy` line and do the second line on the original df – EdChum May 19 '16 at 08:47