1

I want to exclude some rows in my pandas' dataframe if they have certain values.

excluded_url_subpath = ['/editer', '/administration', '/voir-les-transactions', '/modifier', '/diffuser', '/creation-paiement']

So I have the, working, solution to do it one by one like :

df = df[df['pagepath'].map(lambda x: False if '/editer' in x else True)]
df = df[df['pagepath'].map(lambda x: False if '/administration' in x else True)]
...

Or I can use the list I wrote. But I tried some stuff and the IDE told me that I cannot access the variable x.

df = df[df['pagepath'].map(lambda x: False for i in excluded_url_subpath if x in i)]

Where is the error here ?

Ragnar
  • 2,550
  • 6
  • 36
  • 70
  • 1
    To the reviewers. Yes someone may have posted something similar, but the answer use a costly regex. I prefer the solution here from @fabio-lipreri – Ragnar Sep 11 '19 at 17:07

2 Answers2

2

You can use regex, I build an example dataframe:

import pandas as pd
data = {'pagepath': ['/editer', 'to_keep', 'to_delete/editer/to_delete', 'hello/voir-les-transactions', 'to_keep'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
print(df)

with the previous code we built the following dataset:

                               pagepath  year  reports
Cochice                         /editer  2012        4
Pima                            to_keep  2012       24
Santa Cruz   to_delete/editer/to_delete  2013       31
Maricopa    hello/voir-les-transactions  2014        2
Yuma                            to_keep  2014        3

Now, I adapted the solution from this answer to your case. First, in order to implement a general solution, I escaped the possible non-alphanumerical character that the string in excluded_url_subpath list can contain.

import re
excluded_url_subpath = ['/editer', '/administration', '/voir-les-transactions', '/modifier', '/diffuser', '/creation-paiement']
safe_excluded_url_subpath = [re.escape(m) for m in excluded_url_subpath]

Now, using contains function, I constructed a regex joining your list in a using |:

df[~df.pagepath.str.contains('|'.join(safe_excluded_url_subpath))]

I obtained the following dataframe:

     pagepath  year  reports
Pima  to_keep  2012       24
Yuma  to_keep  2014        3
FabioL
  • 932
  • 6
  • 22
1

You can do it by filtering the dataframe like:

for excluded in excluded_url_subpath:
      df['pagepath'] = df[df['pagepath'] != excluded] 
Manuel
  • 117
  • 2
  • 9
  • I see what you are doing, but here the algorithm will look at the whole DF for each loop. It's working the same way as trying each value with hard coded string. Just a bit more automatable. – Ragnar Sep 12 '19 at 08:03