-1

I have the following script in Python that checks how many lines in column A in my dataset contains more than 3 signs like "?" or "!":

doublesigns = dataset[dataset["columnA"].str.contains("\?{3}|\!{3}", na=False)]

Now I want to do the same for more than 3 letters in a row, so that I can check errors in spelling, like "helllo" instead of "hello". In the script below I have tried this for 4 letters in the alphabet:

doublesigns = dataset[dataset["columnA"].str.contains("\a?{3}|\b{3}|\c{3}|\d{3}", na=False)]

I get the following error:

error: bad escape \c at position 10

It looks like the error is occurring with certain letters, but not all of the letters. Does someone know what the right script is?

khelwood
  • 55,782
  • 14
  • 81
  • 108
marita
  • 143
  • 7
  • 3
    what do you think `\a` or `\c` do in regex? – matszwecja Mar 08 '23 at 09:27
  • pandas `str.contains` uses regex to check for patterns. In Regex you need to escape certain characters (e.g. `.\^[` and a few others). You do not need to escape letters. So the pattern should be `"a{3}|b{3}|c{3}|d{3}"`. But there are easier methods than to check every single letter. See for example [this answer](https://stackoverflow.com/a/1660758/14906662). – Jeanot Zubler Mar 08 '23 at 09:35

1 Answers1

1
  1. The backslash before ? and ! are because those symbols have special meaning for regular expressions.

  2. If you have a|b|c|...|z you can express that as [a-z].

  3. You can capture something by surrounding with parenthesis e.g. ([a-z]).

  4. You can use backreferences to match exactly the same value of a preceding capture group n with a \n.

So, a regular expression to match exactly three lower case letters would be ([a-z])\1\1 or ([a-z])\1{2}, meaning, a letter followed by two copies of itself.

Try it here

Bob
  • 13,867
  • 1
  • 5
  • 27