0

In my Java app, I want to use a regex to be able to know if a string exists or not in a text.

The case I want to cover is this one: let's assume that my original text is the following french text (with an accent):

démo test

I want to know if the word demo (without accent) exists in the text, using a regex. The thing is: I can't change the original text (I can't use Normalizer.normalize() for example), since I'm using a library that takes a regex as an argument.

Here is what I tried:

  • If I use "(?i)démo", there is a match (since démo exists)
  • If I use "(?i)demo", there is no match, but I also want a match here. I want the regex to be accent insensitive.

So far, I haven't managed to find a regex that can cover that specific case.

Is there any regex that can cover that case?

Thanks for your help.

matteoh
  • 2,810
  • 2
  • 29
  • 54
  • That text doesn't contain `demo` without an accent. Does your matcher say that it does? Please include the code, various input cases and expected output in your question (not the comments). – RealSkeptic Mar 27 '19 at 17:31
  • should the regex return true when the text is `démo` or only if it is`demo`? – user85421 Mar 27 '19 at 17:32
  • @CarlosHeuberger: the regex should tell me that "démo test" and "demo" match. – matteoh Mar 27 '19 at 17:34
  • Are you processing only french language? Or you may have unknown number of letters with accents? – Pavel Smirnov Mar 27 '19 at 17:44
  • You have to Normalize the text first. Does Java do that ? –  Mar 27 '19 at 17:45
  • do you mean `boolean check = "démo test".matches("[a-zA-ZÀ-ÖØ-öø-ÿ\\s]+");` – Youcef LAIDANI Mar 27 '19 at 17:45
  • @PavelSmirnov Yes, only french language – matteoh Mar 27 '19 at 17:46
  • In this case you have limited number of letters. And you can append them to corresponding normalized letters in your regexp, i.e. "d(e|é)mo" – Pavel Smirnov Mar 27 '19 at 17:50
  • Try this `(?i)d[e\xE9\xC9]mo` –  Mar 27 '19 at 17:54
  • If you _could_ change the input text, the best solution would be to normalize it and remove marks. See this answer: https://stackoverflow.com/q/35783135/3688648 – Felk Mar 27 '19 at 18:02
  • Or, you could normalized just the accent characters within your regex, see https://www2.rocketlanguages.com/french/lessons/french-accents/ –  Mar 27 '19 at 18:03
  • @sln Thanks, I already know how french accents work (I'm french), but I don't know how to normalize in the regex – matteoh Mar 27 '19 at 18:29

2 Answers2

0

Assuming you really cannot change the input text, the following works:

If your input text is in decomposed form, meaning that démo consists of the unicode codepoints d e COMBINING ACUTE ACCENT m o, you can optionally match the accent:

de\pM?mo

where \pM describes the unicode property "Mark". This would match all marks. You can also just optionally match \u0301 directly if you only care about that exact accent

If your text is in composed form, meaning démo consists of the unicode codepoints d LATIN SMALL LETTER E WITH ACUTE m o, you'll have to just manually match either in the regex:

d(e|é)mo
Felk
  • 7,720
  • 2
  • 35
  • 65
0

One way is to modify the regex literal to search and replace the accented
characters with a class.

 Regex string           Replace string
---------------------------------------------
Find any one          Replace with this lieral:
of these:

 [aâàä]         ->       [aâàä]
 [cç]           ->       [cç]
 [eéèêë]        ->       [eéèêë]
 [iîï]          ->       [iîï]
 [oô]           ->       [oô]
 [uùûü]         ->       [uùûü]
 [?œ]           ->       ????

This requires running 7 separate regexes on the search string.
It would be a global find / replace, seven times.