11

I need to match when a string begins with number, then a dot follows, then one space and 1 or more upper case characters. The match must occur at the beginning of the string. I have the following string.

1. PTYU fmmflksfkslfsm

The regular expression that I tried with is:

^\d+[.]\s{1}[A-Z]+

And it does not match. What would a working regular expression be for this problem?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
user152508
  • 3,053
  • 8
  • 38
  • 58
  • [Matches for me](http://regexpal.com/?flags=&regex=^\d%2B[.]\s{1}[A-Z]%2B&input=1.%20PTYU%20fmmflksfkslfsm) but could be rewritten to `^\d+\.\s[A-Z]+` – Felix Kling Dec 16 '10 at 18:00
  • 2
    `{1}` is redundant: it only clutters the expression and can (should) be removed in favor of clarity. – Bart Kiers Dec 16 '10 at 18:03
  • 1
    Read about Java and regex: http://www.regular-expressions.info/java.html. @AlexR and @codaddict are both right. You need to use `\\ ` in Java to create one `\ `. – Felix Kling Dec 16 '10 at 18:07

3 Answers3

27

(Sorry for my earlier error. Brain now firmly engaged. Er, probably.)

This works:

String rex = "^\\d+\\.\\s\\p{Lu}+.*";

System.out.println("1. PTYU fmmflksfkslfsm".matches(rex));
// true

System.out.println(". PTYU fmmflksfkslfsm".matches(rex));
// false, missing leading digit

System.out.println("1.PTYU fmmflksfkslfsm".matches(rex));
// false, missing space after .

System.out.println("1. xPTYU fmmflksfkslfsm".matches(rex));
// false, lower case letter before the upper case letters

Breaking it down:

  • ^ = Start of string
  • \d+ = One or more digits (the \ is escaped because it's in a string, hence \\)
  • \. = A literal . (or your original [.] is fine) (again, escaped in the string)
  • \s = One whitespace char (no need for the {1} after it) (I'll stop mentioning the escapes now)
  • \p{Lu}+ = One or more upper case letters (using the proper Unicode escape — thank you, tchrist, for pointing this out in your comment below. In English terms, the equivalent would be [A-Z]+)
  • .* = Anything else

See the documentation here for details.

You only need the .* at the end if you're using a method like String#match (above) that will try to match the entire string.

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • 1
    It’s hard to tell if the OP’s stuck using 7-bit ASCII data, or whether he needs it to work on any Java characters — which are Unicode, not ASCII. If the latter, you of course need to make adjustments. `\p{Lu}` is probably good enough for uppercase letters, but Java offers no convenient way to talk about Unicode whitespace, so you have to write `[\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]`, as [I have elsewhere written](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261). – tchrist Dec 16 '10 at 18:15
  • 1
    One really shouldn’t say that `[A-Z]+` matches “one or more upper case letters”, because that’s what `\p{Lu}+` does. `[A-Z]+` merely matches one or more (and preferring more) of A to Z — which I hold to be slightly but significantly different. Similarly, `\s` isn’t a whitespace char, but rather one of `[ \t\n\x0B\f\r]` only. Am I just being too finicky here? I work on immense corpora of gigabytes of Unicode characters — but *never* ASCII — daily using both Java and Perl, so perhaps I need to be more careful than others. Or maybe not? – tchrist Dec 16 '10 at 18:22
  • 1
    @tchrist: **very, very good points** I can't believe I did something so English-centric. I've ticked otheer people off for it. Much appreciate your ticking me off for it!! – T.J. Crowder Dec 16 '10 at 19:28
  • And fixed (I was rushing out the door before, wanted to double-check first). – T.J. Crowder Dec 16 '10 at 22:24
1

It depends which method are you using. I think it will work if you use Matcher.find(). It will not work if you are using Matcher.matches() because match works on whole line. If you are using matches() fix your pattern as following:

^\d+\.\s{1}[A-Z]+.*

(pay attention on trailing .*)

And I'd also use \. instead of [.]. It is more readable.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
AlexR
  • 114,158
  • 16
  • 130
  • 208
0

"^[0-9]+\. [A-Z]+ .+"

khachik
  • 28,112
  • 9
  • 59
  • 94