0

I am using this line of code

df_mask = ~df[new_col_titles[:1]].apply(lambda x : x.str.contains('|'.join(filter_list), flags=re.IGNORECASE)).any(1)

to create a mask for my df. The filter list is

filter_list = ["[1]", "[2]", "[3]", "[4]", "[5]", "[6]", "[7]", "[8]","[9]",..."[n]"]

But I am having weird results I was hoping it would just filter the rows in column 0 of the df that have [1]...[n] in. But it doesn't it is also filtering rows that don't have those elements in. There is somewhat a pattern to it though. It will filter out rows that have numbers with "characters" by which i mean £55, 2010), 55*, 55 *

Can anyone explaine what is going on and if there is a workaround for this?

Barmar
  • 741,623
  • 53
  • 500
  • 612
JPWilson
  • 691
  • 4
  • 14
  • it's tough to visualize what's going on. Can you provide sample input and expected output? https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – David Erickson Sep 01 '20 at 22:48
  • `[]` has special meaning in regular expressions. You need to escape it if you want to match it literally. – Barmar Sep 01 '20 at 22:53
  • `[1]` matches the digit `1`, it doesn't match the square brackets. – Barmar Sep 01 '20 at 22:54

1 Answers1

1

If you want to match the items in filter list exactly, use re.escape() to escape the special characters. [1] is a regular expression that just matches the digit 1, not the string [1].

df_mask = ~df[new_col_titles[:1]].apply(lambda x : x.str.contains('|'.join(re.escape(f) for f in filter_list), flags=re.IGNORECASE)).any(1)

See Reference - What does this regex mean?

Barmar
  • 741,623
  • 53
  • 500
  • 612