Regex and special characters

Question

I have a regex that I need to filter messages containing the words

gratis, grátis, grétis, grâtis, grôtis......

So I thinkg this should be easy just like /gr.tis/ but it does not work. I am using this regex on Centos to filter emails using postfix.

The problem is that if the message contains "gratis" it gets filtered, but if it contains "grátis" or "grétis"... it does not. What is going on?

EDIT for some reason .{1,5} worked. Why?

Have a look at this answer to a similar question: http://stackoverflow.com/a/26900132/201706 — Mike P, Nov 06 '15 at 16:08
@shawnt00 didnt work. Mike P I will try your suggestion wright now. — Samul, Nov 06 '15 at 16:10
What about `(*UCP)gr\Xtis` ? Is that [pcre regex](https://www.regex101.com/r/mY2kF3/1) and input unicode? — bobble bubble, Nov 06 '15 at 16:13
@bobble bubble UCP does not work, I am using regex in Postfix maybe there is a limitation — Samul, Nov 06 '15 at 16:23
`\X` is the "unicode dot" in pcre. It matches any unicode character. — bobble bubble, Nov 06 '15 at 16:25
to answer your edit, it is probably some encoding issue. your engine might be reading the string as its unicode character representation when it reaches a non ASCII letter, something like `\u00FA` (probably not this actual one considering its longer than 5 chars) — R Nar, Nov 06 '15 at 16:28
@R Nar you are right! Can you post this as an answer so I can approve it? — Samul, Nov 06 '15 at 16:37
What would happen if you tried something like `/gr.[^ -~]{0,4}tis/` — bobble bubble, Nov 06 '15 at 16:45

score 0 · Answer 1 · answered Nov 06 '15 at 16:13

0

Try like this /gr.*tis/. It Seems like an encoding problem caused by the special characters, i.e, "á", "ô", ...

answered Nov 06 '15 at 16:13

guilhermerama

750
9
21

1

WORKED! But the problem is that thi message will also be filtered out "eu GRovei arTISta muito bem" – Samul Nov 06 '15 at 16:17

score 0 · Answer 2 · edited Jul 26 '16 at 12:54

0

I would choose something a little more robust...

(?<=\b)(g|G)r(.)tis(?=\b)

This will find the word at the start or in the middle of a string,
search for capital G or lower case g
stop before at a space, end of line, or non-word character like "," or "."

If you use

gr.[^ -~]{0,4}tis

Then you will match the "gratis" in a word like lksdfkjhasgratisaljsdhfkjsdf because gratis is in the middle of it and the regex is insufficient to know that gratis is just a part of the word and not the word itself. So you will end up with false positives and an inflated number.

Not only that but you will never match -

Gratis Grátis Grétis Grâtis or Grôtis

Edited my answer to improve the number of steps taken

edited Jul 26 '16 at 12:54

Rhubbarb

4,248
6
36
40

answered Nov 06 '15 at 18:02

Nefariis

3,451
10
34
52

really nice, it's really good answer but why cant I just use gr([^ ]+)tis ? – Samul Nov 06 '15 at 19:56
because you will not match Gratis or any any other capitol G form of the word. Also you will match every word that has gratis in the word like Gratisography (which is actually a word), because "gr([^ ]+)tis" is not bounded... So the reason not to do it is because you will both miss words and will match words you do not want – Nefariis Nov 06 '15 at 20:07
To show you an example of this - This is my code https://regex101.com/r/bK0hJ0/1 ... and this would be the other code https://regex101.com/r/eR6uH4/1 ... Notice the other code is wrong quite a bit – Nefariis Nov 06 '15 at 20:18
Good, I hope it was helpful - also make sure to show some checkmark love. – Nefariis Nov 06 '15 at 21:19

score 0 · Accepted Answer · answered Nov 06 '15 at 18:07

0

As said in my comment:

The reason that replacing . with .{1,5} works is that what engine isreading the string is reading non-ASCII letters/symbols as something other than their actual character (ie, it could be the unicode character representation of the symbol like \u00FF or something)

that is why the answer of guilhermerama: /gr.*tis/ and replacing the . token to take in multiple instances would work.

answered Nov 06 '15 at 18:07

R Nar

5,465
1
16
32

You might have UTF-8 encoded data. The UTF-8 encoding of Unicode is designed to work well (but not perfectly) when processed using (8-bit) ASCII routines. (e.g. with UTF-8, like ASCII, no zero bytes appear in a stream representing text.) An accented character such as 'á' (Unicode codepoint decimal 225, or hex 0xE1) is UTF-8 encoded with bytes 0xC3 0xA1, and a routine expecting ASCII will probably interpret this as the two characters 'Ã' and '¡'. (Really ASCII is a 7-bit code, so by 8-bit ASCII here, I really mean some 8-bit extension of ASCII such as the ISO 8859-1 / Latin-1 character set.) – Rhubbarb Jul 26 '16 at 13:11

Regex and special characters

3 Answers3