7

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:

private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");

which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:

String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);

No joy. For the sample input below, the dash is not detected, and titleSegmentSeparator.matcher(sectionTitle).find() returns false!

In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?

Sample input:

Study Summary (1 of 10) – Competition

S(83)t(116)u(117)d(100)y(121) (32)S(83)u(117)m(109)m(109)a(97)r(114)y(121) (32)((40)1(49) (32)o(111)f(102) (32)1(49)0(48))(41) (32)–(8211) (32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)

It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?

tchrist
  • 78,834
  • 30
  • 123
  • 180
Alterscape
  • 1,526
  • 1
  • 17
  • 34
  • From the docs: "the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014." That is, you can remove the double \\ in your expression. – aioobe Jun 15 '10 at 13:29
  • @aioobe: What an enormous coincidence that the Java docs have used exactly the one character as an example that this question is about. Or did you modifiy the quote? – Tim Pietzcker Jun 15 '10 at 13:51

1 Answers1

13

You're mixing decimal (8211) and hexadecimal (0x8211).

\x and \u both expect a hexadecimal number, therefore you'd need to use \u2014 to match the em-dash, not \u8211 (and \x2D for the normal hyphen etc.).

But why not simply use the Unicode property "Dash punctuation"?

As a Java string: "\\s\\p{Pd}\\s"

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Alas, Java doesn’t support the Unicode `Dash` property in its regexes, which includes things like the MINUS SIGN, which is of type Symbol. – tchrist Mar 29 '12 at 18:50