1

I have an email field in a form which is currently validated using GenericValidator.isEmail method. But now I need to apply another validation where I need to prevent accented characters being sent to the email address. So I was thinking of using a Regex Pattern Matching approach and I found one in stackoverflow itself

if (Pattern.matches(".*[éèàù].*", input)) {
  // your code
}

Problem is I saw only é è à ù characters in the pattern but there are several other accented characters like õ ü ì etc. So is there a way we can match pattern for all types of accented characters? I needed to match for NL (Dutch), FR(French) and DE(German) language accented characters. I need to check if my email address has any accented character and if it does need to stop execution there and throw an error

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Maybe subtract the `A-Za-z` from `\p{L}`? `Pattern.matches("(?s).*[\\p{L}&&[^A-Za-z]].*", input)`? – Wiktor Stribiżew Mar 24 '21 at 09:43
  • I am new to regex pattern matching. could you kindly tell me the meaning of "?s" and "\\p{L}". Just unable to evaluate what this pattern is evaluating to from the left – amitsingh6651 Mar 24 '21 at 09:48
  • Does it work as intended for you? It matches `ł`, `ф`, `Й`, etc., any letters but the ASCII letters. – Wiktor Stribiżew Mar 24 '21 at 09:54
  • are you sure only accent characters are problem for you? E.g. German alphabet does not have accented letters, per se, they are ü, ö, ä which is not exactly accented. Can we assume you want only English characters from all alphabets out there? – Boris Strandjev Mar 24 '21 at 10:16
  • Yes I tried with èàùüõìÉǞ and it works infact all of the accented characters I tried but did not get what you meant by ASCII letters? U mean the normal a-z and A-Z for which there are ASCII Codes right? only for these it wont work? – amitsingh6651 Mar 24 '21 at 10:21
  • Yes, ASCII letters means the "English" letters from the English alphabet. – Wiktor Stribiżew Mar 24 '21 at 10:50
  • @WiktorStribiżew well thanks it works indeed !! :) – amitsingh6651 Mar 24 '21 at 11:15

1 Answers1

1

It turns out you want to match any letter but an ASCII letter.

I suggest substracting ASCII letters from the \p{L} pattern that matches any Unicode letter:

Pattern.matches("(?s).*[\\p{L}&&[^A-Za-z]].*", input)

Here,

  • (?s) - Pattern.DOTALL embedded flag option that makes . match across lines
  • .* - any zero or more chars, as many as possible
  • [\\p{L}&&[^A-Za-z]] - any Unicode letter except ASCII letters
  • .* - any zero or more chars, as many as possible.

Note it is better to use find() since it also returns partial matches, and there is no need using (?s).* and .* in the above pattern, making it much more efficient with longer strings:

Pattern.compile("[\\p{L}&&[^A-Za-z]]").matcher(input).find()

See this Java demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563