1

I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this. I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?

Fruit 2 matches 5 matches
Apple No No
NaN No No
Pear No No
Pear Yes No
Pear Yes No
Pear Yes No
Pear Yes Yes
NaN No No
NaN No No
NaN No No
NaN No No
NaN No No
Banana No No
Banana Yes No

Update: testing solutions by @Corralien

counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
            .transform('cumcount').add(1)  # cumulative counter
            .where(df['Fruit'].notna(), other=0))  # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')

VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.

Optimvs
  • 15
  • 4

1 Answers1

4

Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:

counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
            .transform('cumcount').add(1)  # cumulative counter
            .where(df['Fruit'].notna(), other=0))  # set NaN to 0

N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})

N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})

Output:

>>> df
     Fruit 2 matches 5 matches
0    Apple        No        No
1      NaN        No        No
2     Pear        No        No
3     Pear       Yes        No
4     Pear       Yes        No
5     Pear       Yes        No
6     Pear       Yes       Yes
7      NaN        No        No
8      NaN        No        No
9      NaN        No        No
10     NaN        No        No
11     NaN        No        No
12  Banana        No        No
13  Banana       Yes        No

>>> counts
0     1
1     0
2     1
3     2
4     3
5     4
6     5
7     0
8     0
9     0
10    0
11    0
12    1
13    2
dtype: int64

Update

if I need to change "Yes" for the fruit name for example

N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)

# Output
     Fruit 2 matches
0    Apple        No
1      NaN        No
2     Pear        No
3     Pear      Pear
4     Pear      Pear
5     Pear      Pear
6     Pear      Pear
7      NaN        No
8      NaN        No
9      NaN        No
10     NaN        No
11     NaN        No
12  Banana        No
13  Banana    Banana
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • If you use `gt` you don't need the `add(1)` ;) – mozway Jan 19 '23 at 21:41
  • I have a doubt for N=1? (I know it doesn't make sense). but I can set NaN to -1 :-) – Corralien Jan 19 '23 at 21:42
  • Yes or keep it NaN and change your condition to convert to yes/no – mozway Jan 19 '23 at 21:45
  • 1
    Too much work for me at this hour :-) – Corralien Jan 19 '23 at 21:46
  • Thanks a lot! Runs perfectly!! I've a doubt, if I need to change "Yes" for the fruit name for example. Can change .replace({True: 'Yes', False: 'No'}) by .replace({True: df['ERC'], False: 'No'}) or adding new line condition? – Optimvs Jan 19 '23 at 22:00
  • Thanks for the update. It works perfectly in Colab individually, but not in my full VSC program. When running the first line "counts=" it goes to a for exception. I had a "SettingWithCopyWarning" warning before, but the application worked fine. Can this warning affect this new code? – Optimvs Jan 19 '23 at 22:37
  • No. However, take a look to https://stackoverflow.com/q/20625582/15239951 – Corralien Jan 19 '23 at 22:40
  • With your first solution now haven't any error and works fine, but when change to the update version returns the exception error again. It is posible to create this new '2 matches? DF column by .loc? I don't find how do it. – Optimvs Jan 20 '23 at 13:13
  • Please update your post with the traceback error. – Corralien Jan 20 '23 at 13:33
  • Updated with the debug console message. Thanks a lot for your time to look it! – Optimvs Jan 20 '23 at 14:23
  • Can you try this version `df['2 matches'] = np.where(counts.ge(N), df['Fruit'], 'No')` please? Don't forget to `import numpy as np` – Corralien Jan 20 '23 at 14:29
  • 1
    Now works perfect! I don't know the differences between this 2 methods, numpy should be strongest. Thanks a lot again! – Optimvs Jan 20 '23 at 14:42
  • It should have the same behavior but sometimes... :-) – Corralien Jan 20 '23 at 14:56