Is there any way to perform faster this operation?

Question

I am dealing with a large medical dataset, and I want to see if there are some rows that corresponds to the same patient. The column that corresponds to the patient's ID is ID Patient.
I want to create a new column where it will be "yes" if that patient appears in more than one row, or "no" if it only appears once.
This is the code that I did:

df['Repeated'] = 'No' # New Column

for i in range(0,len(df)):
    for f in range(0,len(df)):
        if df['ID NP'].iloc[i] == df['ID NP'].iloc[f]:
            df['ID NP'].iloc[i] = 'Yes'
        else:
            df['ID NP'].iloc[i] = 'No'

However this operation is taking too much time. Is there any way to do it faster?

You could use [duplicated](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.duplicated.html). — hilberts_drinking_problem, Mar 19 '20 at 20:04

score 0 · Accepted Answer · answered Mar 19 '20 at 20:10

Not only does it take too much time, it also sets all ilocs to "Yes" in all rows.

You only need to count how many times different ilocs can be found. So count them:

stats = {}
for i in df['ID NP'].iloc:
    stats[i] = stats.get(i, 0) + 1

Now you only need to iterate through iloc indices:

for i in range(0, len(df['ID NP'].iloc)):
    id_np = df['ID NP'].iloc[i]
    if stats[id_np] > 1:
        # then there should be 'Yes' in this row

Is there any way to perform faster this operation?

1 Answers1