Regex to match some number of upper case characters at the beginning of a String

Question

Can anyone recommend a Regex that would match on the following rules:

Upper case or a space

My strings that I want to match look like this

LONDON 10 Downing St, London

or this

NEW YORK 2859 Broadway, New York, NY 10025

I want to be able to match the words LONDON and NEW YORK when I pass in each line.

P.S. I am doing this in Java

Hmm. So did you attempt anything? – Rohit Jain Aug 04 '13 at 15:42 — Rohit Jain, Aug 04 '13 at 15:42

Tim Pietzcker · Accepted Answer · 2013-08-04T17:16:04.107

5

Beginning of the string: ^
Uppercase letter: \p{Lu}
Space:
Combining the two: [\p{Lu} ]
Any number of the preceding token: *
Assertion that the match ends at the end of a word (requires Java 7 to work reliably): \b

Your regex, therefore, is

^[\p{Lu} ]*\b

Don't forget to double the backslashes to comply with Java's string escaping rules:

In Java 7:

Pattern regex = Pattern.compile("^[\\p{Lu} ]*\\b", Pattern.UNICODE_CHARACTER_CLASS);

In Java 6 and below:

Pattern regex = Pattern.compile("^[\\p{Lu} ]*(?<=\p{Lu})");

edited Aug 04 '13 at 17:16

answered Aug 04 '13 at 15:43

Tim Pietzcker

328,213
58
503
561

Does `\b` not work in Java 6 and below? I didn't know that. – arshajii Aug 04 '13 at 15:46
2

@arshajii: [It only matches at ASCII word boundaries](http://stackoverflow.com/q/4304928/20670). – Tim Pietzcker Aug 04 '13 at 15:48
2

If you're not coding for/in Java 7, use `(?<=\p{Lu})` instead of `\b`. That's a [positive lookbehind assertion](http://www.regular-expressions.info/lookaround.html) making sure that the previous character is an uppercase letter. – Tim Pietzcker Aug 04 '13 at 15:55
2

Tim, it may be that even in Java 7 you will need `(?u)` or `Pattern.UNICODE_CHARACTER_CLASS` to get the `\b` to work with non-ASCII. I’d check to make sure. I would have to look up the mailing list discussions from 2–3 years ago to see what the resolution was to the `\b` mess. – tchrist Aug 04 '13 at 16:50
@tchrist: [The docs](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS) seem to suggest that you do need to set that flag explicitly (or use `(?U)` with an uppercase `U`). Thanks! – Tim Pietzcker Aug 04 '13 at 17:16
@TimPietzcker Oops, right: yes, it’s an uppercase `(?U)`. – tchrist Aug 04 '13 at 17:33

p.s.w.g · Answer 2 · 2013-08-04T16:22:07.517

1

You can use this pattern:

^[A-Z ]+

This will match one or more upper case Latin letters or spaces from the beginning of the string.

You can easily modify this to avoid capturing trailing spaces:

^[A-Z ]*[A-Z]

edited Aug 04 '13 at 16:22

answered Aug 04 '13 at 15:42

p.s.w.g

146,324
30
291
331

It will always end with one or more spaces, though. – Jongware Aug 04 '13 at 16:07
@Jongware I've provided a modified version of the pattern to exclude trailing spaces, if that's required. – p.s.w.g Aug 04 '13 at 16:14

score -2 · Answer 3 · answered Aug 04 '13 at 15:45

-2

Use this:

^\u+( \u+)*

It matches a number of uppercase characters, optionally followed by groups of (single space, more uppercase characters). This will avoid always ending with a space.

answered Aug 04 '13 at 15:45

Jongware

22,200
8
54
100

2

`\u` is not a valid regex token in any regex flavor I know, and most certainly not in Java (unless you can show me the documentation that says so). – Tim Pietzcker Aug 04 '13 at 15:47
@tim-pietzcker: how very odd. It's available in [InDesign](http://help.adobe.com/en_US/indesign/cs/using/WSFB3603CC-8D84-48d8-9F77-F3E0644CB0B6a.html#WSa285fff53dea4f8617383751001ea8cb3f-6f59a), and InDesign's GREP uses boost. Never thought Adobe would invent such a useful shortcut! Seems you have to use the Posix shorthand or the explicit range [A-Z] after all. – Jongware Aug 04 '13 at 16:06
It would be nice if Regex could be more intuitive. – Ankur Aug 04 '13 at 16:29
@Ankur What does “intuitive” mean? – tchrist Aug 04 '13 at 16:51
1

@TimPietzcker Well, yes and no and maybe. There are places where `\u` means to convert the next code point to titlecase, but that’s not technically at the regex level. It derives from the popularity of being able to say `:%s/[a-z]+/\u\L&/g` in the `vi` editor: notice it falls on the replacement side not the search side, which means that it is really a string not a pattern. Perl uses it that way, where it compiles into a call to the `ucfirst` function, which in turn maps to the Unicode titlecase version of that next code point. – tchrist Aug 04 '13 at 16:54

Regex to match some number of upper case characters at the beginning of a String

3 Answers3