2

How can I find words with three or more vowels of the same kind with a regular expression using back referencing?

I'm searching in text with a 3-column tab format "Word+PoS+Lemma".

This is what I have so far:

ggrep -P -i --colour=always '^\w*([aeioueöäüèéà])\w*?\1\w*?\1\w*?\t' filename

However, this gives me words with three vowels but not of the same kind. I'm confused, because I thought the back referencing would refer to the same vowel it found in the brackets? I solved this problem by changing the .*? to \w*.

Thanks for the help!

sgelena
  • 23
  • 5

3 Answers3

2

Your regex looks too complicated, not sure what you're trying to accomplish with the .*? but the usage looks suspect. I'd use something like:

([aeioueöäüèéà])\1\1

i.e. match a vowel as a capture group, then say you need two more.

Didn't realise you wanted to allow other letters between vowels, just allow zero or more "word" letters between backreferences:

([aeioueöäüèéà])(\w*\1){2}
Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • demo: https://regexr.com/6vgfr – Sam Mason Oct 05 '22 at 21:23
  • Hi but doesn't that match three vowels in a row? I'm looking for words like 'beseelen' too, which e.g. has three e's. I just changed the `.*?` to `\w*?`. – sgelena Oct 05 '22 at 21:24
  • why do you keep putting `*?`, I'd suggest looking at a regex reference page. you're saying optionally match zero or more, but allowing zero implies it's optional! – Sam Mason Oct 05 '22 at 22:13
1

I suggest with GNU grep:

grep -E --colour=always -i '\b\w*([aeioueöäüèéà])(\w*\1){2,}\w*'

See: The Stack Overflow Regular Expressions FAQ

Cyrus
  • 84,225
  • 14
  • 89
  • 153
-1

Using grep

$ grep -E '(([aeioueöäüèéà])[^\2]*){3,}' input_file
HatLess
  • 10,622
  • 5
  • 14
  • 32