1

We are debugging some old code and came across this statement, anyone know what it's doing?

String value=...
value.toLowerCase(Locale.ENGLISH).split("[^\\w]+");
Eng.Fouad
  • 115,165
  • 71
  • 313
  • 417
user646584
  • 3,621
  • 5
  • 25
  • 27
  • 3
    That won't even compile. `split()` returns a `String[]`, but the result is being assigned to a `String`. – Alan Moore Aug 16 '11 at 01:18
  • It is probably not assigned to a string (see the two occurrences of 'value' var), I think the ... just swallowed the semicolon. However the array resulting from the split is simply dropped. :) – fgysin Aug 16 '11 at 06:16

3 Answers3

4

The answer is that it’s doing a lot of things rather naïvely. Why else would they use a negated character class of a word character [^\w] for what can more readably be had in a simple \W? Doesn’t make any sense.

Plus the locale silliness suggests that they must be afraid they’re in Turkey, since I don’t know any other locale but Turkish and Azeri where there is ever a difference in casing. Normally LATIN CAPITAL LETTER I lowercases to LATIN SMALL LETTER I as you would expect, but in Turkic languages it lowercase LATIN SMALL LETTER DOTLESS I.

Even so, it won’t work on right for Unicode unless they use the embedded "(?U)" flag only available in Java 7. You can’t make \w and \W play by Unicode rules just by that silly pointless locale thing. You must use the "(?U)", or else, if you are actually compiling the pattern, the UNICODE_CHARACTER_CLASSES flag. Both of those need Java 7. Before that, Java is worse than merely useless for handling Unicode with regex charclass shortcuts like that. It’s actually misleading, wrong, and harmful.

Otherwise the dumb thing will think that a regular English word like naïvely has two words separated by a nonword sequence. It is super stupid.

So in answer to your question, I don’t think it’s doing what its author thinks it’s doing. I’m guarantee you that it’s broken unless it’s entirely ASCII text. See here for the hellish things that happened before Java 7 and what you had to do to work around them, and see here for some of what Java 7 brings to the table.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
3

It appears to be splitting by substrings of non-word characters (represented by [^\w]), into words.

BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
  • Rather poorly though. The locale setting suggests they’re worried about letters that English doesn’t have, but the `\w` is complete bollocks with such characters. I think it’s broken. See my answer for why. – tchrist Aug 16 '11 at 01:12
0

Solit the string on each group of non word characters. a word character is a letter, number, or underscore. The string splits on groups of anything else.

Paul
  • 139,544
  • 27
  • 275
  • 264