Pandas. Leave only the most frequent rows in huge df

Question

I have df with 30 millions rows with form:

And i need to extract rows where the value in the first column is repeated multiple times (e.g. 100). I'm try rude method:

df1 = pd.DataFrame()
state_last = None
for index,row in df.iterrows():
    if row.loc['S1'] != state_last: #to skip itterations where im already estimate part of df
        state_last = row.loc['S1']
        temp = df.loc[df['S1']==row['S1']]
        if temp.shape[0] > 100:
            df1=df1.append(temp)

also i try

for i in range(19709): #max number in df
    temp = df.loc[df['S1']==i]
    if temp.shape[0] > 100:
            df1=df1.append(temp)

But these methods are too ineffective. Can this be done more quickly? Thanks in advance

@sammywemmy pls don't post solution as comment – Anurag Dabas Aug 14 '21 at 07:17 — Anurag Dabas, Aug 14 '21 at 07:17
kindly post your expected output – sammywemmy Aug 14 '21 at 07:23 — sammywemmy, Aug 14 '21 at 07:23

Nishad Wadwekar · Accepted Answer · 2021-08-14T09:28:57.757

0

Assuming your columns as first_column and second_column

you can do

df = df.loc[df['first_column'].duplicated(keep=False), :]

you can read more here

EDIT:

You can do a group by on first_column and count the no of rows and then loc on count > 100

check this answer from Pedro M Duarte

edited Aug 14 '21 at 09:28

answered Aug 14 '21 at 08:08

Nishad Wadwekar

88
1
8

1

you don't need the `, :` ;) – mozway Aug 14 '21 at 08:20
This will allow me to delete rows that have no duplicates in the first column (I don't have any at all, I'm sure). I need to delete rows that have too few duplicates in the first column. For example, if the number rows with "1" in the first column is less than 100, then I do not need all the rows where the first column is "1". – Александр Aug 14 '21 at 09:11

Pandas. Leave only the most frequent rows in huge df

1 Answers1