2

I have a regex that I need to filter messages containing the words

gratis, grátis, grétis, grâtis, grôtis......

So I thinkg this should be easy just like /gr.tis/ but it does not work. I am using this regex on Centos to filter emails using postfix.

The problem is that if the message contains "gratis" it gets filtered, but if it contains "grátis" or "grétis"... it does not. What is going on?

EDIT for some reason .{1,5} worked. Why?

Samul
  • 1,824
  • 5
  • 22
  • 47

3 Answers3

0

Try like this /gr.*tis/. It Seems like an encoding problem caused by the special characters, i.e, "á", "ô", ...

guilhermerama
  • 750
  • 9
  • 21
  • 1
    WORKED! But the problem is that thi message will also be filtered out "eu GRovei arTISta muito bem" – Samul Nov 06 '15 at 16:17
0

I would choose something a little more robust...

(?<=\b)(g|G)r(.)tis(?=\b)
  • This will find the word at the start or in the middle of a string,
  • search for capital G or lower case g
  • stop before at a space, end of line, or non-word character like "," or "."

If you use

gr.[^ -~]{0,4}tis

Then you will match the "gratis" in a word like lksdfkjhasgratisaljsdhfkjsdf because gratis is in the middle of it and the regex is insufficient to know that gratis is just a part of the word and not the word itself. So you will end up with false positives and an inflated number.

Not only that but you will never match -

Gratis Grátis Grétis Grâtis or Grôtis

Edited my answer to improve the number of steps taken

Rhubbarb
  • 4,248
  • 6
  • 36
  • 40
Nefariis
  • 3,451
  • 10
  • 34
  • 52
  • really nice, it's really good answer but why cant I just use gr([^ ]+)tis ? – Samul Nov 06 '15 at 19:56
  • because you will not match Gratis or any any other capitol G form of the word. Also you will match every word that has gratis in the word like Gratisography (which is actually a word), because "gr([^ ]+)tis" is not bounded... So the reason not to do it is because you will both miss words and will match words you do not want – Nefariis Nov 06 '15 at 20:07
  • To show you an example of this - This is my code https://regex101.com/r/bK0hJ0/1 ... and this would be the other code https://regex101.com/r/eR6uH4/1 ... Notice the other code is wrong quite a bit – Nefariis Nov 06 '15 at 20:18
  • Good, I hope it was helpful - also make sure to show some checkmark love. – Nefariis Nov 06 '15 at 21:19
0

As said in my comment:

The reason that replacing . with .{1,5} works is that what engine isreading the string is reading non-ASCII letters/symbols as something other than their actual character (ie, it could be the unicode character representation of the symbol like \u00FF or something)

that is why the answer of guilhermerama: /gr.*tis/ and replacing the . token to take in multiple instances would work.

R Nar
  • 5,465
  • 1
  • 16
  • 32
  • You might have UTF-8 encoded data. The UTF-8 encoding of Unicode is designed to work well (but not perfectly) when processed using (8-bit) ASCII routines. (e.g. with UTF-8, like ASCII, no zero bytes appear in a stream representing text.) An accented character such as 'á' (Unicode codepoint decimal 225, or hex 0xE1) is UTF-8 encoded with bytes 0xC3 0xA1, and a routine expecting ASCII will probably interpret this as the two characters 'Ã' and '¡'. (Really ASCII is a 7-bit code, so by 8-bit ASCII here, I really mean some 8-bit extension of ASCII such as the ISO 8859-1 / Latin-1 character set.) – Rhubbarb Jul 26 '16 at 13:11