3

wanted to match the non-latin char. tried it. as per my understanding if (a.matches("[\\x8A-\\xFF]+")) should return true but its false.

String a = "Ž";
if (a.matches("[\\x8A-\\xFF]+"))
{

}
Romi
  • 4,833
  • 28
  • 81
  • 113
  • "Ž" is not in that range. – Maroun May 28 '15 at 07:36
  • "Ž" is 8E which is in the range – Romi May 28 '15 at 07:37
  • You mean you want to disregard from the diacritics? – aioobe May 28 '15 at 07:40
  • 5
    False. In Unicode, "Ž" is 0x017D - http://www.unicode.org/charts/PDF/U0100.pdf – Stephen C May 28 '15 at 07:40
  • What is your intent? To match all the characters in your range and also `Ž`? Then just add it to the character class `"[\\x8A-\\xFF\\u017D]+"`. If you want to find the extended characters only, you have an answer already. – Wiktor Stribiżew May 28 '15 at 07:43
  • @Romi *"Ž is 8E"*--in extended latin charset, not Unicode. – Alex Salauyou May 28 '15 at 07:46
  • @SashaSalauyou, do you have a reference for that? – aioobe May 28 '15 at 07:48
  • 1
    probably you're misunderstanding the codepoint of the character in some codepage/charset with Unicode. Read this [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html) – phuclv May 28 '15 at 07:50
  • I meant a reference for the extended latin charset that says Ž is 8E. – aioobe May 28 '15 at 07:52
  • @aioobe you're right, not 8E, but AE in ISO Latin-2 charset: https://msdn.microsoft.com/ru-ru/goglobal/cc305168.aspx. But it still "fits" the range provided in OP's example. – Alex Salauyou May 28 '15 at 07:57
  • @aioobe I won't surprise if there is some Czech or Polish custom charset where Ž is exactly 8E (for example, for Cyrillic we have 5 or more different charsets) – Alex Salauyou May 28 '15 at 08:08
  • Yes. This was what I was curious about. If it was clear where Ž = 8E came from, it would be easier to sort out where the confusion came from and to provide a good answer. – aioobe May 28 '15 at 08:10
  • @aioobe look: https://www.microsoft.com/typography/unicode/1250.gif. It is Windows-1250 (http://en.wikipedia.org/wiki/Windows-1250) – Alex Salauyou May 28 '15 at 08:15
  • http://www.ascii-code.com/ here in found Ž = 8E – Romi May 28 '15 at 10:36
  • 1
    @Romi When you have String in Java, you are working with Unicode character (well, you still need to be aware that String in Java is UTF-16). Pattern in Java, since Java 5, always matches in term of Unicode code point. How the character is encoded in some other encoding is irrelevant when you hold a String. It has been take care of when you decode the byte stream into String. – nhahtdh May 28 '15 at 13:09

1 Answers1

6

Judging from your title:

Regex to match non-latin char with ASCII 0-31 and 128-255

it seems you're after all characters except those in range 32-127 and you're surprised Ž doesn't match.

If this is correct, I suggest you use the expression [^\x20-\x7F] ("all characters except those in range 32-127"). This does match Ž.

(An exact translation of the regex in your title would look like [\x00-\x1F\x80-\xFF] but this still doesn't match Ž as described below.)

Why your initial attempt didn't work:

The \xNN matches characters unicode values. The unicode value for Ž is 0x017D, i.e. it falls outside of the range \x8A-\xFF.

When you say "Ž" is 8E you're most likely seeing a value from an extended ASCII table, and these are not the values that the Java regex engine works with.

aioobe
  • 413,195
  • 112
  • 811
  • 826
  • if (a.matches("[\\x8A-\\xFF]+")) its not matching its returning false. – Romi May 28 '15 at 07:38
  • @Romi Exactly, you should negate the character group. – Maroun May 28 '15 at 07:39
  • @aioobe than what will be the regexp expression to match 0-31 and 128-159. as i want to include extended char like "Ž" too. – Romi May 28 '15 at 10:42
  • That would be `[\x00-\x1F\x80-\x9F]`, but as I mention in my answer, regular expressions doesn't work with ascii-values. So it's not clear that that expression will work for you. – aioobe May 28 '15 at 11:11
  • @Romi, if you tell me exactly what characters you want to match (and I mean characters, not ASCII-values) then I can help you put together a regular expression. (If you just provide extended ASCII values I don't know which code page you're talking about (8E can in fact mean different things depending on which code page you're using) and regular expressions don't work together with ASCII values outside the range 0-127.) – aioobe May 28 '15 at 23:20
  • @aiboobe : on given page I see that Ž is 8E still does not match to given regular expression. – Romi May 29 '15 at 06:39
  • Ah, right. But check [this page](http://www.ascii-codes.com/) (it lists Ä for 8E), or [this page](http://academic.evergreen.edu/projects/biophysics/technotes/program/ascii_ext-mac.htm) (it lists é for 8E). So which one should we look at when we write regexps? Well, none of those, because those are ASCII tables. We should look at a table showing unicode codepoints, such as [this page](http://en.wikipedia.org/wiki/List_of_Unicode_characters). As you can see on that page, Ž is in fact 0x017D. – aioobe May 29 '15 at 07:38
  • Now, here's a potential follow-up question: *But if I write the byte 0x8E in a file on my system, it will be printed as Ž!* It may be the case that your default encoding indeed maps 0x8E to Ž. However, when Java reads 0x8E and translates it to Ž, it will forget that it once was 0x8E on disk and it will *just* remember the symbol Ž. So when you try to match it, 0x8E isn't around anymore. Ž is around, and when encountering a `\xNN` expression, it will look at the unicode codepoint for Ž which is 0x017D. – aioobe May 29 '15 at 07:42
  • @Romi, is my answer (and comments) clear? Anything still unresolved? – aioobe Jun 08 '15 at 07:21