6

I want to detect words of Unicode Letters (\p{L}).

Scala's REPL gives back false for the following statement, while in Java it's true (which is the right behaviour):

java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()

Both Java and Scala are running in JRE 1.7:

System.getProperty("java.version") gives back "1.7.0_60-ea"

What could be the reason for that?

pvorb
  • 7,157
  • 7
  • 47
  • 74
  • 1
    See http://stackoverflow.com/questions/5315330/matching-e-g-a-unicode-letter-with-java-regexps for why your regex isn't quite sufficient. You need `\p{L}\p{M}*` – The Archetypal Paul Feb 17 '14 at 20:15

2 Answers2

6

Probably a non-compatible character encoding used within the interpreter. For example, here's my output:

scala> System.getProperty("file.encoding")
res0: String = UTF-8

scala> java.util.regex.Pattern.compile("\\p{L}").matcher("ä").matches()
res1: Boolean = true

So the solution is to run scala with -Dfile.encoding=UTF-8. Note, however, this blog post (which is a bit old) :

The only reliable way we've found for setting the default character encoding for Scala is to set $JAVA_OPTS before running your application:

$ JAVA_OPTS="-Dfile.encoding=utf8" scala [...] Just trying to set scala -Dfile.encoding=utf8 doesn't seem to do it. [...]


Wasn't the case here, but may also happen: alternatively, your "ä" could be a diaeresis (umlaut) sign followed by "a", e.g.:

scala> println("a\u0308")                                                                                             
ä                                                                                                                                                                                                                    
scala> java.util.regex.Pattern.compile("\\p{L}").matcher("a\u0308").matches()                                         
res1: Boolean = false

This is sometimes a problem on some systems which create diacritics through Unicode combining characters (I think OS X is one, at least in some versions). For more info, see Paul's question.

Community
  • 1
  • 1
mikołak
  • 9,605
  • 1
  • 48
  • 70
  • 1
    That's it. `scala> System.getProperty("file.encoding")` gives me `res0: String = Cp1252` since I'm on Windows. Thank you for the information. – pvorb Feb 17 '14 at 21:39
  • @pvorb: thanks for the info, I'll edit the post to highlight the "main" solution accordingly. – mikołak Feb 17 '14 at 21:53
3

You can also "Enable the Unicode version of Predefined character classes and POSIX character classes" as described in java.util.regex.Pattern and UNICODE_CHARACTER_CLASS

This means you can use character classes such as '\w' to match Unicode characters like this:

"(?U)\\w+".r.findFirstIn("pässi")

In the regexp above '(?U)' bit is an Embedded Flag Expressions that turns on the UNICODE_CHARACTER_CLASS flag for the regexp.

This flag is supported starting from Java 7.

marko
  • 622
  • 1
  • 6
  • 16