0

I have a csv that looks like this:

screen_name,tweet,following,followers,is_retweet,bot
narutouz16,Grad school is lonely.,59,20,0,0
narutouz16,RT @GetMadz: Sound design in this game is 10/10 game freak lied. ,59,20,1,0
narutouz16,@hbthen3rd I know I don't.,59,20,0,0
narutouz16,"@TonyKelly95 I'm still not satisfied in the ending, even though its longer.",59,20,0,0
narutouz16,I'm currently in second place in my leaderboards in duolongo.,59,20,0,0

I am able to read this into a dataframe using the following:

df = pd.read_csv("file.csv")

That works great. I get the following dimensions when I print(df.shape) (1223726, 6)

I have a list of usernames, like below:

bad_names = ['BELOZEROVNIKIT',  'ALTMANBELINDA',    '666STEVEROGERS',   'ALVA_MC_GHEE',     'CALIFRONIAREP',    'BECCYWILL',    'BOGDANOVAO2',  'ADELE_BROCK',  'ANN1EMCCONNELL',   'ARONHOLDEN8',  'BISHOLORINE',  'BLACKTIVISTSUS',   'ANGELITHSS',   'ANWARJAMIL22',     'BREMENBOTE',   'BEN_SAR_GENT',     'ASSUNCAOWALLAS',   'AHMADRADJAB',  'AN_N_GASTON',  'BLACK_ELEVATION',  'BERT_HENLEY',  'BLACKERTHEBERR5',  'ARTHCLAUDIA',  'ALBERTA_HAYNESS',  'ADRIANAMFTTT']

What I want to do is loop through the dataframe, and if the username is in this list at all, to remove those rows from df and add them to a new df called bad_names_df.

Pseudocode would look like:

for each row in df:
    if row.username in bad_names:
        bad_names_df.append(row)
        df.remove(row)
    else:
        continue

My attempt:

for row, col in df.iterrows():
    if row['username'] in bad_user_names:
        new_df.append(row)
    else:
        continue

How is it possible to (efficiently) loop through df, with over 1.2M rows, and if the username is in the bad_names list, remove that row and add that row to a bad_names_df? I have not found any other SO posts that address this issue.

artemis
  • 6,857
  • 11
  • 46
  • 99
  • Possible duplicate of [Delete rows from a pandas DataFrame based on a conditional expression involving len(string) giving KeyError](https://stackoverflow.com/questions/13851535/delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression-involving) – AMC Nov 22 '19 at 03:05
  • The general operation is the same, which is removing rows based on a boolean condition. There are some good answers there on how to do just that. – AMC Nov 22 '19 at 03:07
  • But its _also_ appending rows. And the condition is different. That, by definition, necessitates a different post. I suggest you re-read the definition of a duplicate before flagging a post, which clearly is not a duplicate, as a duplicate. – artemis Nov 22 '19 at 03:08
  • Isn't the objective to separate/filter the rows? Is appending necessary? – AMC Nov 22 '19 at 03:09

2 Answers2

1

You can apply a lambda then filter as follows:

df['keep'] = df['username'].apply(lambda x: False if x in bad_names else True)
df = df[df['keep']==True]
Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
  • So this creates a new column, `keep`, and fills it with a `True` or `False` value. Then, I recreate that `df` based on if that value is `True`. Is there a more memory efficient way with more than ~1M rows? (It's fine if not :)), additionally, where am I creating the second df with the bad names? `bad_names_df = df[df['keep']==False]`? – artemis Nov 22 '19 at 01:53
  • Sure: `df[~df['username'].apply(lambda x: False if x in bad_names else True)]` would work too. – Yaakov Bressler Nov 22 '19 at 03:29
1

You can also create a mask using isin:

mask = df["screen_name"].isin(bad_names)
print (df[mask])  #df of bad names
print (df[~mask]) #df of good names
Henry Yik
  • 22,275
  • 4
  • 18
  • 40