0

I want to keep only the value of the first instance of a sequence, following values to be set to zero. If the value repeats again later in the series, it should also be captured. e.g -265.95745849609375 is present at the top and at the bottom

* 
    s1
    Index               value  
    847.7248427790372   -265.95745849609375
    847.7448445792772   -265.95745849609375
    847.8448535804773   -265.95745849609375
    847.8648553807175   -480.0611789817236
    847.8848571809574   -714.2857666015625
    848.0048679823976   -714.2857666015625
    848.0248697826377   -714.2857666015625
    ....                .....
    849.0449615948793   -714.2857666015625
    849.0649633951193   -550.6575933264419
    849.0849651953594   -446.4285583496094
    849.1849741965596   -446.4285583496094
    ...                 ...
    849.2449795972797   -446.4285583496094
    849.8650354047206   -248.9522315559211
    849.8850372049607   -265.95745849609375
    849.9050390052007   -265.95745849609375
    849.9250408054407   -265.95745849609375

*

Expected outcome:

*

847.7248427790372   -265.95745849609375
847.7448445792772   0
847.7648463795173   0
847.8648553807175   -480.0611789817236
847.8848571809574   -714.2857666015625
847.9048589811974   0
847.9248607814375   0
848.0248697826377   0
....                .....
849.0449615948793   0
849.0649633951193   -550.6575933264419
849.0849651953594   -446.4285583496094
849.1049669955994   0
849.1249687958394   0
849.1849741965596   0
...                 ...
849.2449795972797   0
849.8650354047206   -248.9522315559211
849.8850372049607   -265.95745849609375
849.9050390052007   0
849.9250408054407   0

Code I used

    for outer in range(1,len(s1['value'])-1):
        if s1['value'].values[outer] == s1['value'].values[outer+1]:
            for inner in range(outer,len(s1['value'])):
                if s1['value'].values[outer] == s1['value'].values[inner]:
                    s1['value'].values[inner] = 0
        outer=inner+1

But it takes longer time to execute this as the number of elements in the series is normally 30000 and above. Can any one help with a better and faster way to do this? Thanks in advance.

1 Answers1

0

you can use series.duplicated to find duplicates and then set then to 0 using np.where or series.mask:

df['value'] = np.where(df['value'].duplicated(),0,df['value'])

this will mark all duplicates as 0. However if you want to start again when a duplicate occurs later in the series and not immediate,you can do:

df['value'] = np.where(df['value'].eq(df['value'].shift()),0,df['value'])

print(df)

         Index       value
0   847.724843 -265.957458
1   847.744845    0.000000
2   847.844854    0.000000
3   847.864855 -480.061179
4   847.884857 -714.285767
5   848.004868    0.000000
6   848.024870    0.000000
7   849.044962    0.000000
8   849.064963 -550.657593
9   849.084965 -446.428558
10  849.184974    0.000000
11  849.244980    0.000000
12  849.865035 -248.952232
13  849.885037 -265.957458
14  849.905039    0.000000
15  849.925041    0.000000
anky
  • 74,114
  • 11
  • 41
  • 70
  • You'll need to pass `keep=False` to the `duplicated()` function otherwise only the first value is marked. – mullinscr Feb 02 '21 at 15:47
  • @mullinscr keep=False will set all duplicates to 0 and will not retain the value of the first duplicate. OP wants the first value retained. – anky Feb 02 '21 at 15:49
  • The OP wants it captured every time it appears. you are right though the keep=False will get the inverse unlless you swap the np.where replacements. E.g. `np.where(s.duplicated(keep=False),s, 0)` – mullinscr Feb 02 '21 at 15:52
  • @mullinscr that will set the non duplicate values to 0. OP wants only duplicates to be marked as 0. Albeit I also think they want to start over again when a duplicate occurs later in the series i.e the duplicate is not consecutive. I have addressed that now. Thank you :) – anky Feb 02 '21 at 15:57
  • Thank you very much for the answers.I am using np.where function. It works like charm. – Sridhar Eswaran Feb 03 '21 at 09:26
  • @SridharEswaran Glad it helped. Avoid loops when dealing with numbers or in general when using pandas wherever possible. :) They are slow. Vectorized methods are fast. – anky Feb 03 '21 at 09:36
  • 1
    @anky Thank you very much for your advise. Will keep that in mind in future :-) – Sridhar Eswaran Feb 04 '21 at 10:02