-2

Can anyone recommend a Regex that would match on the following rules:

  • Upper case or a space

My strings that I want to match look like this

LONDON 10 Downing St, London

or this

NEW YORK 2859 Broadway, New York, NY 10025

I want to be able to match the words LONDON and NEW YORK when I pass in each line.

P.S. I am doing this in Java

Ankur
  • 50,282
  • 110
  • 242
  • 312

3 Answers3

5
  • Beginning of the string: ^
  • Uppercase letter: \p{Lu}
  • Space:  
  • Combining the two: [\p{Lu} ]
  • Any number of the preceding token: *
  • Assertion that the match ends at the end of a word (requires Java 7 to work reliably): \b

Your regex, therefore, is

^[\p{Lu} ]*\b

Don't forget to double the backslashes to comply with Java's string escaping rules:

In Java 7:

Pattern regex = Pattern.compile("^[\\p{Lu} ]*\\b", Pattern.UNICODE_CHARACTER_CLASS);

In Java 6 and below:

Pattern regex = Pattern.compile("^[\\p{Lu} ]*(?<=\p{Lu})");
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Does `\b` not work in Java 6 and below? I didn't know that. – arshajii Aug 04 '13 at 15:46
  • 2
    @arshajii: [It only matches at ASCII word boundaries](http://stackoverflow.com/q/4304928/20670). – Tim Pietzcker Aug 04 '13 at 15:48
  • 2
    If you're not coding for/in Java 7, use `(?<=\p{Lu})` instead of `\b`. That's a [positive lookbehind assertion](http://www.regular-expressions.info/lookaround.html) making sure that the previous character is an uppercase letter. – Tim Pietzcker Aug 04 '13 at 15:55
  • 2
    Tim, it may be that even in Java 7 you will need `(?u)` or `Pattern.UNICODE_CHARACTER_CLASS` to get the `\b` to work with non-ASCII. I’d check to make sure. I would have to look up the mailing list discussions from 2–3 years ago to see what the resolution was to the `\b` mess. – tchrist Aug 04 '13 at 16:50
  • @tchrist: [The docs](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS) seem to suggest that you do need to set that flag explicitly (or use `(?U)` with an uppercase `U`). Thanks! – Tim Pietzcker Aug 04 '13 at 17:16
  • @TimPietzcker Oops, right: yes, it’s an uppercase `(?U)`. – tchrist Aug 04 '13 at 17:33
1

You can use this pattern:

^[A-Z ]+

This will match one or more upper case Latin letters or spaces from the beginning of the string.

You can easily modify this to avoid capturing trailing spaces:

^[A-Z ]*[A-Z]
p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
-2

Use this:

^\u+( \u+)*

It matches a number of uppercase characters, optionally followed by groups of (single space, more uppercase characters). This will avoid always ending with a space.

Jongware
  • 22,200
  • 8
  • 54
  • 100
  • 2
    `\u` is not a valid regex token in any regex flavor I know, and most certainly not in Java (unless you can show me the documentation that says so). – Tim Pietzcker Aug 04 '13 at 15:47
  • @tim-pietzcker: how very odd. It's available in [InDesign](http://help.adobe.com/en_US/indesign/cs/using/WSFB3603CC-8D84-48d8-9F77-F3E0644CB0B6a.html#WSa285fff53dea4f8617383751001ea8cb3f-6f59a), and InDesign's GREP uses boost. Never thought Adobe would invent such a useful shortcut! Seems you have to use the Posix shorthand or the explicit range [A-Z] after all. – Jongware Aug 04 '13 at 16:06
  • It would be nice if Regex could be more intuitive. – Ankur Aug 04 '13 at 16:29
  • @Ankur What does “intuitive” mean? – tchrist Aug 04 '13 at 16:51
  • 1
    @TimPietzcker Well, yes and no and maybe. There are places where `\u` means to convert the next code point to titlecase, but that’s not technically at the regex level. It derives from the popularity of being able to say `:%s/[a-z]+/\u\L&/g` in the `vi` editor: notice it falls on the replacement side not the search side, which means that it is really a string not a pattern. Perl uses it that way, where it compiles into a call to the `ucfirst` function, which in turn maps to the Unicode titlecase version of that next code point. – tchrist Aug 04 '13 at 16:54