2

I have a pandas dataframe created from measured numbers. When something goes wrong with the measurement, the last value is repeated. I would like to do two things:
1. Change all repeating values either to nan or 0.
2. Keep the first repeating value and change all other values nan or 0.

I have found solutions using "shift" but they drop repeating values. I do not want to drop repeating values.My data frame looks like this:

df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12:15,1]=22
df.iloc[10:12,2]=0.23

giving a dataframe like this:

            0          1         2
0    1.239916   1.109434  0.305490
1    0.248682   1.472628  0.630074
2   -0.028584  -1.116208  0.074299
3   -0.784692  -0.774261 -1.117499
4   40.000000   0.283084 -1.495734
5   40.000000  -0.074763 -0.840403
6   40.000000   0.709794 -1.000048
7   40.000000   0.920943  0.681230
8   -0.701831   0.547689 -0.128996
9   -0.455691   0.610016  0.420240
10  -0.856768  -1.039719  0.230000
11   1.187208   0.964340  0.230000
12   0.116258  22.000000  1.119744
13  -0.501180  22.000000  0.558941
14   0.551586  22.000000 -0.993749

what I would like to be able to do is write some code that would filter the data and give me a data frame like this:

           0         1         2
0   1.239916  1.109434  0.305490
1   0.248682  1.472628  0.630074
2  -0.028584 -1.116208  0.074299
3  -0.784692 -0.774261 -1.117499
4        NaN  0.283084 -1.495734
5        NaN -0.074763 -0.840403
6        NaN  0.709794 -1.000048
7        NaN  0.920943  0.681230
8  -0.701831  0.547689 -0.128996
9  -0.455691  0.610016  0.420240
10 -0.856768 -1.039719       NaN
11  1.187208  0.964340       NaN
12  0.116258       NaN  1.119744
13 -0.501180       NaN  0.558941
14  0.551586       NaN -0.993749

or even better keep the first value and change the rest to NaN. Like this:

            0          1         2
0    1.239916   1.109434  0.305490
1    0.248682   1.472628  0.630074
2   -0.028584  -1.116208  0.074299
3   -0.784692  -0.774261 -1.117499
4   40.000000   0.283084 -1.495734
5         NaN  -0.074763 -0.840403
6         NaN   0.709794 -1.000048
7         NaN   0.920943  0.681230
8   -0.701831   0.547689 -0.128996
9   -0.455691   0.610016  0.420240
10  -0.856768  -1.039719  0.230000
11   1.187208   0.964340       NaN
12   0.116258  22.000000  1.119744
13  -0.501180        NaN  0.558941
14   0.551586        NaN -0.993749
piRSquared
  • 285,575
  • 57
  • 475
  • 624
getaglow
  • 343
  • 4
  • 15
  • could you share your attempt with the shift function that did not work for you. my suggestion involves the shift function, and produces output exactly as your desired output. I'm curious to see what you tried with the shift function. – Haleemur Ali Jan 24 '18 at 17:51
  • I was looking at this solution that used the "shift" command: – getaglow Jan 24 '18 at 20:05
  • https://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates – getaglow Jan 24 '18 at 20:05

2 Answers2

4

using shift & mask:

df.shift(1) == df compares the next row to the current for consecutive duplicates.

df.mask(df.shift(1) == df)

# outputs    
            0          1         2
0    0.365329   0.153527  0.143244
1    0.688364   0.495755  1.065965
2    0.354180  -0.023518  3.338483
3   -0.106851   0.296802 -0.594785
4   40.000000   0.149378  1.507316
5         NaN  -1.312952  0.225137
6         NaN  -0.242527 -1.731890
7         NaN   0.798908  0.654434
8    2.226980  -1.117809 -1.172430
9   -1.228234  -3.129854 -1.101965
10   0.393293   1.682098  0.230000
11  -0.029907  -0.502333       NaN
12   0.107994  22.000000  0.354902
13  -0.478481        NaN  0.531017
14  -1.517769        NaN  1.552974

if you want to remove all the consecutive duplicates, test that the previous row is also the same as the current row

df.mask((df.shift(1) == df) | (df.shift(-1) == df))
Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85
  • I can't thank you enough!!! You are going to make me look good. Thank you. – getaglow Jan 24 '18 at 20:11
  • honestly, this is one of those answers that has so many applications that this immediately goes into my pandas toolbelt. the `mask` and `shift` combo is so simple - yet so powerful. – MattR Jan 25 '18 at 14:54
2

Option 1
Specialized solution using diff. Get's at the final desired output.

df.mask(df.diff().eq(0))

            0          1         2
0    1.239916   1.109434  0.305490
1    0.248682   1.472628  0.630074
2   -0.028584  -1.116208  0.074299
3   -0.784692  -0.774261 -1.117499
4   40.000000   0.283084 -1.495734
5         NaN  -0.074763 -0.840403
6         NaN   0.709794 -1.000048
7         NaN   0.920943  0.681230
8   -0.701831   0.547689 -0.128996
9   -0.455691   0.610016  0.420240
10  -0.856768  -1.039719  0.230000
11   1.187208   0.964340       NaN
12   0.116258  22.000000  1.119744
13  -0.501180        NaN  0.558941
14   0.551586        NaN -0.993749
piRSquared
  • 285,575
  • 57
  • 475
  • 624