0

I have df with 30 millions rows with form:

0;401
0;924
0;925
1;145
1;414
1;673
2;144
2;145
2;153

And i need to extract rows where the value in the first column is repeated multiple times (e.g. 100). I'm try rude method:

df1 = pd.DataFrame()
state_last = None
for index,row in df.iterrows():
    if row.loc['S1'] != state_last: #to skip itterations where im already estimate part of df
        state_last = row.loc['S1']
        temp = df.loc[df['S1']==row['S1']]
        if temp.shape[0] > 100:
            df1=df1.append(temp)

also i try

for i in range(19709): #max number in df
    temp = df.loc[df['S1']==i]
    if temp.shape[0] > 100:
            df1=df1.append(temp)

But these methods are too ineffective. Can this be done more quickly? Thanks in advance

1 Answers1

0

Assuming your columns as first_column and second_column

you can do

df = df.loc[df['first_column'].duplicated(keep=False), :]

you can read more here

EDIT:

You can do a group by on first_column and count the no of rows and then loc on count > 100

check this answer from Pedro M Duarte

  • 1
    you don't need the `, :` ;) – mozway Aug 14 '21 at 08:20
  • This will allow me to delete rows that have no duplicates in the first column (I don't have any at all, I'm sure). I need to delete rows that have too few duplicates in the first column. For example, if the number rows with "1" in the first column is less than 100, then I do not need all the rows where the first column is "1". – Александр Aug 14 '21 at 09:11