Validate a string that can contain any characters, but letters from a specific alphabet/script

Question

I have this string

String s = "Some text, some text!"

I need check string, and if this string have character from other language, like Hebrew or Russian then return false, otherwise if string have only english char(optional with spaces and punct) return true. Of cource string like this String s = ", , ." must return false.

I was try this code

Pattern pEng = Pattern.compile("\\p{Alpha}+\\p{Space}*\\p{Punct}*\\p{Digit}*");
pEng.matcher(s).matches()

but its return false

What i do wrong? Already spend many time for find answer, who can help?

Maybe you are right.I was check this question fast and after write this `Pattern pEng2 = Pattern.compile("[\\p{Alpha}[\\p{Space}\\p{Punct}\\p{Digit}]*]+");` Seems like this work for me now — Stanislav Rymar, Oct 12 '18 at 13:14
@StanislavRymar That's because your original pattern didn't allow for any alphanumeric characters after the punctuation, btw. — OhleC, Oct 12 '18 at 13:15
@trilogy Unfortanly code in comment return true for space and punct without text( — Stanislav Rymar, Oct 12 '18 at 13:20
If yes, see [this demo](https://ideone.com/Algd4R) with `.matches("[\\p{ASCII}&&[^A-Za-z]]*[A-Za-z]\\p{ASCII}*")`. To only match printable ASCII with at least 1 ASCII letter: `.matches("[ -~&&[^A-Za-z]]*[A-Za-z][ -~]*")` (see [demo](https://ideone.com/rVc3NV)) — Wiktor Stribiżew, Oct 12 '18 at 13:36
@WiktorStribiżew Seems like its works for me, need make some tests, thx! — Stanislav Rymar, Oct 12 '18 at 13:47
**BTW** If you are actually looking for ***FOREIGN LANGUAGE UTF-8***, then **IT IS NOT OK** to use the simple A-Za-z notation that regular-expressions provide. I host two foreign-language translation web-sites: SpanishNewsBoard.com and ChineseNewsBoard.com - you need to explicity define which UTF-8 characters you are looking for. In Spanish, accented vowels are common, and I use this regular expression often: Pattern.compile("[ÁÉÍÓÚÝÜÑáéíóúýüñ]"), If **Hebrew, Russian, etc** (Cyrillic) alphabets use different ASCII or UTF-8 character, you must explicitly name them in your regex pattern. — , Oct 12 '18 at 17:55
**NOTE** The **DUPLICATE** suggestion answer to this question does not answer the question at all - since it explicitly requests foreign-language characters - which is what I've sort of programmed for over 2 years now. — , Oct 12 '18 at 18:00

Wiktor Stribiżew · Accepted Answer · 2018-10-13T15:33:34.953

1

To match a string that only contains ASCII chars and has at least one ASCII letter, you may use

s.matches("[\\p{ASCII}&&[^A-Za-z]]*[A-Za-z]\\p{ASCII}*")

See this Java demo

If you do not want to allow control chars in the input, use a variation of the pattern:

s.matches("[ -~&&[^A-Za-z]]*[A-Za-z][ -~]*")

See this Java demo.

Note that .matches requires a full string match, hence, there is no need adding ^ and $ / \z anchors around the pattern.

Pattern details

[ -~&&[^A-Za-z]]* - 0 or more printable ASCII chars except ASCII letters (&&[^...] is a character class subtraction, it is here to make the pattern work faster, more efficiently)
[A-Za-z] - an ASCII letter (=\p{Alpha})
[ -~]* - 0 or more printable ASCII chars.

The \p{ASCII} Unicode property class matches any ASCII chars.

Additional info

If you need to match a string with only certain script/alphabet letters and any other chars in a string, you may use

s.matches("\\P{L}*(?:[A-Za-z]\\P{L}*)+")

This [A-Za-z] is for English, for Russian, you would use [а-яА-ЯёЁ].

Now, say you want to only match a string whose letters can only be Hebrew letters inside. Since \p{InHebrew} contains all Hebrew script, not just letters, you would use an intersection of this class and a letter \p{L} class, [\p{InHebrew}&&[\p{L}]]:

str.matches("\\P{L}*(?:[\\p{InHebrew}&&[\\p{L}]]\\P{L}*)+")
                       ^^^^^^^^^^^^^^^^^^^^^^^^^

edited Oct 13 '18 at 15:33

answered Oct 12 '18 at 13:52

Wiktor Stribiżew

607,720
39
448
563

Can i adapt this pattern to another language? Replace [A-Za-z] to \\p{InHebrew}? – Stanislav Rymar Oct 12 '18 at 14:04
@StanislavRymar Please explain in words what kind of string (what it should consist of, any obligatory chars?) you want to match. Provide one or two examples that should be valid and 1 or 2 invalid examples. – Wiktor Stribiżew Oct 12 '18 at 14:08
Another idea: `s.matches("\\P{L}*(?:[A-Za-z]\\P{L}*)+")` - it matches any string that contains any chars but letters, and then 0+ repetitions of an ASCII letter followed with 0+ non-letters. – Wiktor Stribiżew Oct 12 '18 at 14:15
I have to check if a string consists of words of one particular language ignoring punctuation of numbers and spaces. If i need check English `" Some text, some!" = true "SomeText12" = true ", . " = false "Some Text ыыы" = false` if i need check Russian `" Какой-то текст, текст12" = true " Какой-то текст, some text" = false " , -." = false` – Stanislav Rymar Oct 12 '18 at 14:48
1

@StanislavRymar You can't easily match a certain language alphabet, you may only match certain scripts, or you will have to build your own character classes. Then, use `s.matches("\\P{L}*(?:[A-Za-z]\\P{L}*)+")` for English, `s.matches("\\P{L}*(?:[а-яА-ЯёЁ]\\P{L}*)+")` for Russian, etc. – Wiktor Stribiżew Oct 12 '18 at 14:57
You need to review the UTF-8 characters you are looking for on one of those UTF-8 language-character charts. I have done so for Spanish, and (laboriously) for Mandarin Chinese. I have never worked with or read Russian, so I cannot provide a list of the Cyrillic or Hebrew alphabet, but you would need to consult what characters would indicate that a String contains Cyrillic or Hebrew. **This type of information is not included in any Regular Expression compiler** that I know of, but there are probably ways to find them for Java. The best way would be to see UTF-8, and enter them manually. – Oct 12 '18 at 17:58
@RalphTorello I already provided the Russian language letter range, `[а-яА-ЯёЁ]`. I also wrote a lot of [other European languages regexps](https://stackoverflow.com/a/30798598/3832970). – Wiktor Stribiżew Oct 12 '18 at 18:05
Yeah, I **think** you did... I didn't even recognize it in your comment!! I study Chinese, and it looked like the English Alphabet (which is the whole point of the question, I think). Mostly, I was upset that they flagged this as a "duplicate question" - but that's only because I sit and agonize about foreign-language UTF-8 chars all the time - and they are **definitely not the same as regular-expression a-zA-Z** Thanks. :) – Oct 12 '18 at 18:25
@StanislavRymar Could you please let me and Ralph know if you need more help with this question? Можно по-русски в чате, если хотите. – Wiktor Stribiżew Oct 12 '18 at 18:49
@WiktorStribiżew All ok! I Change range in you regex to POSIX \\p{InHebrew}, \\p{InCyrillic}, \\p{Alpha}, and all work great! Спасибо огромное, это помогло, беглые тесты показали что все подходит! – Stanislav Rymar Oct 13 '18 at 12:44

Validate a string that can contain any characters, but letters from a specific alphabet/script

1 Answers1