1

I'm having the following problem with regex: I've written a program that reads words from some text (txt) files and writes into another file, writing one word per line.

Everything works fine, except if the word read has a special characters ľščťžýáíé in it. The regex deletes the char and splits the word where the special char was.

For Example :
Input:

I am Jožo.

Output:

I
am
Jo
o

Here's a snippet of the code:

while( (line = br.readLine())!= null ){ 
  Pattern p = Pattern.compile("[\\w']+");
  Matcher m = p.matcher(line);
}
anubhava
  • 761,203
  • 64
  • 569
  • 643
DRastislav
  • 1,892
  • 3
  • 26
  • 40
  • Try this link, http://stackoverflow.com/questions/2276200/changing-default-encoding-of-python . Do you know what the byte representation of ž is? – JustinDanielson Jul 11 '13 at 20:49

2 Answers2

5

Instead of this regex:

Pattern.compile("[\\w']+")

Use Unicode based:

Pattern.compile("[\\p{L}']+")

It is because by default \\w in Java matches only ASCII characters, digits 0-9 and underscore.

Another option is to use the modifier

Pattern.UNICODE_CHARACTER_CLASS

Like this:

Pattern.compile("[\\w']+", Pattern.UNICODE_CHARACTER_CLASS)
anubhava
  • 761,203
  • 64
  • 569
  • 643
-1

\\w matches only a-z, A-Z and 0-9 (English alphabet plus numbers) if you want to accept any character except whitespaces as part of a word, use \\S

Jan Martiška
  • 1,151
  • 5
  • 7