1

I am looking for a regex pattern to filter out words in a sentence with no repeated consecutive characters.

I have tried r'(?!.*(\w)\1{3,}).+' as the regex pattern but it doesn't work.

for instance, in the sentence 'mike is amaaazing', I want the regex pattern to pick up 'mike' and 'is' only.

Any ideas?

Sina
  • 29
  • 4

2 Answers2

3

You have to use a word-boundary at the beginning and replace the dot with \w to be sure your lookahead doesn't go out of the tested word.

>>> s = 'mike is amaaazing'
>>> [m[1] for m in re.findall(r'\b(?!\w*?(\w)\1)(\w+)', s)]
['mike', 'is']

Since re.findall returns only capture groups when defined in the pattern, you can use a list comprehension to extract the second capture group (in which is the whole word).

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • what should I use as token_pattern in countvectorizer if I want to filter out words with repeated consecutive characters? – Sina Oct 19 '19 at 19:46
  • @Sina: I really don't know, but perhaps you can grab all words and filter them in a second time with something like re.match. – Casimir et Hippolyte Oct 19 '19 at 19:53
  • 1
    I might be wrong, but there is a quantifier `{3,}` in the OP's pattern. Perhaps to do match for example `meeting` and not `amaaazing` you could make it `\1{2,}` 1+ though. – The fourth bird Oct 19 '19 at 20:15
  • 1
    @Thefourthbird: I don't know: there's the question title and something else in the question body. But actually the problem isn't here for Sina. – Casimir et Hippolyte Oct 19 '19 at 20:20
2

You can try something like this

\b(?:(\w)(?!\1))+\b

enter image description here

Regex Demo

Code Maniac
  • 37,143
  • 5
  • 39
  • 60
  • @Sina it matches only what you're said in question. can you elaborate what is not working ? rule which is not followed ? – Code Maniac Oct 19 '19 at 19:34
  • when I try it on ```string='mike is amaaaazing'``` and ```pattern=r'\b(?:(\w)(?!\1))+\b'```, then ```re.findall(pattern,string)``` returns ```[e,s]``` – Sina Oct 19 '19 at 19:35
  • 1
    @Sina: `re.findall` returns only the captures if defined, otherwise it returns the whole match. – Casimir et Hippolyte Oct 19 '19 at 19:36
  • i wanna use it with re.findall...so what should be the pattern? – Sina Oct 19 '19 at 19:36
  • @Sina you `finditer`, [`How can i find all the matches`](https://stackoverflow.com/questions/4697882/how-can-i-find-all-matches-to-a-regular-expression-in-python) – Code Maniac Oct 19 '19 at 19:41
  • I also want to use this pattern in countvectorizer as token_pattern, and the pattern you provided doesn't do the job – Sina Oct 19 '19 at 19:43
  • for instance, I don't want countvectorizer to pick up 'aaaaa' in a sentence – Sina Oct 19 '19 at 19:44