validate special characters by negating unicode letters with regex pattern?

Question

This regex: \p{L}+ matches these characters "ASKJKSDJKDSJÄÖÅüé" of the example string "ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé" which is great but is the exact opposite of what I want. Which leads me to negating regexes.

Goal:

I want to match any and all characters that are not a letter nor a number in multiple languages.

Could a negative regex be a natural direction for this?

I should mention one intended use for the regex I'd like to find is to validate passwords for the rule:

that it needs to contain at least one special character, which I define as not being a number nor a letter.

It would seem defining ranges of special characters should be avoided if possible, because why limit the possibilities? Thus my definition. I assume there could be some problems with such a wide definition, but it is a first step.

If you have some suggestions for a better solution I'm giving below or just have some thoughts on the subject, I'm sure I'm not the only one that would like to learn about it. Thanks.

Note I'm using double \\ in the Java code. Platform is Java 11.

Thanks for the suggestion @VGR. That's a POSIX character class (US-ASCII only) though and I want basically not all letters, including the umlauts you see in the example string. Otherwise a very useful suggestion. I'm going to test it for completeness later. — MiB, Feb 07 '22 at 20:03
It will be fully Unicode compliant if you use the [UNICODE_CHARACTER_CLASS](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS) flag. — VGR, Feb 07 '22 at 21:33

MiB · Answer 1 · 2022-02-07T15:18:26.260

So after having read similar, though not identical questions and some equally great answers, I came up with this solution: (?=\P{L})(?=\P{N}) meaning match both not letters and not numbers. Even if I'm asserting numbers separately I need to negate both to meet the specification of special characters (See question).

This is making use of a non-consuming regular expression with the parentheses and the?=, first matching the expression in the first parenthesis and after that continue to match the whole in the second. Thanks to @Jason Cohen for this detail in the Regular Expressions: Is there an AND operator? discussion.

The upper case P in \P{L} and \P{N} expresses the "not belonging to a category" in Unicode Categories, where the uppercase P means "not", i e the opposite of a lowercase p.

It's not perfect for a real world solution, but works as a starting point at least. Note I'm using double \\ in the Java code. Platform is Java 11.

score 1 · Accepted Answer · answered Feb 07 '22 at 14:54

1

You can shove those \\p things in []. And thus, use the fact that you can negate chargroups. This is all you need:

Pattern p = Pattern.compile("[^\\p{L}]");
Matcher m = p.matcher("ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé");
while (m.find()) System.out.print(m.group(0));

That prints:

_-.;,!”#€%&/()=?`¨’<>

Which is exactly what you're looking for, no?

No need to mess with lookaheads here.

answered Feb 07 '22 at 14:54

rzwitserloot

85,357
5
51
72

1

Yes, it would seem this solution works for the not letter requirement. Maybe I should have been more clear I started with that and not only add the number requirement in the text, which kinda necessitates lookahead if we also can add numbers to the string, no? My apologies I didn't include this more specifically. – MiB Feb 07 '22 at 15:00
2

Trivially, `[^\\p{L}\\p{N]]` - you need to stop this love affair with lookahead. Lookahead is for, well, __lookahead__. That's not what you're doing. – rzwitserloot Feb 07 '22 at 15:54
1

Yes, that does the job as does this: `[\\P{L}&&\\P{N}]`. When I tried that in the Regexlab app I had to add a "+": `[\P{L}&&\P{N}]+`. I wonder why that is. I don't know the app well, just start there sometimes. – MiB Feb 07 '22 at 19:14
This great suggestion is what I use now with similar problems. Thanks @rzwitserloot! – MiB Mar 28 '22 at 21:23

validate special characters by negating unicode letters with regex pattern?

Goal:

2 Answers2