Help with regex

Question

I'm constructing a regex which will accept at least 1 alpha numerical character and any number of spaces.

Right now I've got...[A-Za-z0-9]+[ \t\r\n]* which I understand to be at least 1 alphanumeric OR at least 1 space. How would I fix this?

EDIT: To answer the comments below I want it to accept strings which contain ATLEAST 1 alphanumeric AND any number of (including no) spaces. Right now it will accept JUST a whitespace.

EDIT2: To clarify, I don't want the any number of whitespace (including 0) to be accepted unless there is at least 1 alphanumeric character

No, your expression is exactly right: it will accept one or more alphanumeric character, *followed by* some (or no) spaces. — Konrad Rudolph, Dec 02 '10 at 21:12
I read the regex you wrote (in a platform neutral way) as : One or more alpha numerical character *followed by* zero or more white spaces. That seems to match the description you asked for? — Amir Afghani, Dec 02 '10 at 21:13
You got it right. It would match at least one alphanumeric followed by zero or more whitespace characters. — detunized, Dec 02 '10 at 21:14
@Ulkmun: In answer to your edits: no, you are wrong. Right now, it will **not** accept just whitespace. It will accept exactly what you want it to. If it behaves unexpected, then the error is somewhere else. — Konrad Rudolph, Dec 02 '10 at 21:35
That’s one of those 1960s-style data-processing things, what being straight ASCII. It’s kind of like overnight delivery in a nanosecond world. Java has always supported Unicode, at least in its marketing glossies, but there is a disturbingly pervasive ASCII-only mentality throughout its user community. That is something that truly perplexes me. It’s time to shed the shackles of the 1960s and step into the Brave New Millennium of Unicode. Maybe by 2250 people will catch up. — tchrist, Dec 02 '10 at 21:59
@tchrist: Do you suppose the U.S. will have fully adopted the metric system by then, too? :D — Alan Moore, Dec 02 '10 at 22:18
@Ulkmun - Can you provide a few examples of strings that should match? A few that should not match? — Freiheit, Dec 02 '10 at 22:22
@Alan: Not unless and until there is a sea change in the terms of discourse, changing it from opposing *metric* vs *English* to instead opposing *standard* vs *Imperial*. See how important that is? The former opposition tears at the heartstrings of one’s mother tongue, being tied up with our cultural self-identity. You must **never** threaten someone’s language, even let them mistakenly think you are. You get the same misplaced antipathy against Unicode, which the miseducated see as somehow supplanting “English” letters. Utter rubbish, of course, but so it is. Just something to think about. — tchrist, Dec 02 '10 at 22:24
@tchrist: I gotta learn to suppress my zinger reflex around you. ;) Excellent point, though (as usual). — Alan Moore, Dec 02 '10 at 23:52

Alan Moore · Accepted Answer · 2010-12-02T23:41:21.997

\s*\p{Alnum}[\p{Alnum}\s]*

Your regex, [A-Za-z0-9]+[ \t\r\n]*, requires the string to start with a letter or digit (or, more accurately, it doesn't start matching until it sees one). Adding \s* allows the match to start with whitespace, but you still won't match any alphanumerics after the first whitespace character that follows an alphanumeric (for example, it won't match the xyz in abc xyz. Changing the trailing \s* to [\p{Alnum}\s]* fixes that problem.

On a side note, \p{Alnum} is exactly equivalent to [A-Za-z0-9] in Java, which is not the case in all regex flavors. I used \p{Alnum}, not just because it's shorter, but because it gives more protection from typos like [A-z] (which is syntactically valid, but almost certainly not what the author really meant).

EDIT: Performance should be considered, too. I originally included a + after the first \p{Alnum}, but I realized that wasn't a good idea. If this were part of a longer regex, and the regex didn't match right away, it could end up wasting a lot of time trying to match the same groups of characters with \p{Alnum}+ or [\p{Alnum}\s]*. The leading \s* is okay, though, because \s doesn't match any of the characters that \p{Alnum} matches.

Well, with *some* whitespace, but you know the story. – tchrist Dec 02 '10 at 22:19 — tchrist, Dec 02 '10 at 22:19

score 1 · Answer 2 · answered Dec 02 '10 at 21:13

1

Any one or more word char zero or more whitespace

\w+\s*

answered Dec 02 '10 at 21:13

Freiheit

8,408
6
59
101

It most certainly does not! `[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]+[\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u2029\u202F\u205F\u3000]*` conforms to the description you have given. – tchrist Dec 02 '10 at 22:16
@aioobe: **That**, sir, is simply the correct expression in Java’s Unicode regex language that lines up @Freiheit’s description, as his description did not match his pattern. You can read more about this scandalous state of affairs [here in this answer](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261). – tchrist Dec 02 '10 at 22:51

cristian · Answer 3 · 2010-12-02T21:17:28.023

0

Hey try this ([^\s]+\s*) [^\s] means catch everything that is not white space, while \s* means that an white space is optional (if you really want at least one white space put + instead of ) Edit: sory mine catch everithing not only alphanumeric (put ([a-zA-Z0-9]+\s) for alphanumeric)

edited Dec 02 '10 at 21:17

answered Dec 02 '10 at 21:12

cristian

8,676
3
38
44

You can use `\S` for matching non whitespace chars. – jjnguy Dec 02 '10 at 21:27
No that is incorrect. In a Java regex, `[^\s]` matches numerous whitespace codepoints, including U+85 and exceedingly common U+A0. Similarly, Java’s `\s` fails to match the 2 whitespace codepoints just mentioned as well as 17 other whitespace codepoints. Not really a workable solution in a language whose native character set is touted as modern. Its support for that native character set is shockingly meagre, even broken in places. – tchrist Dec 02 '10 at 22:13

aioobe · Answer 4 · 2010-12-02T22:12:43.597

0

This should do the trick:

\s*\p{Alnum}+\s*

\p{Alnum} is an alphanumeric character: [\p{Alpha}\p{Digit}]
* says "zero or more times"
+ says "at least one" (not "or" as you seem to believe, or is written |)
| means "or"
\s is a whitespace character: [ \t\n\x0B\f\r]

EDIT: To answer the comments below I want it to accept strings which contain AT LEAST 1 alphanumeric AND any number of (including no) spaces.

The pattern I suggested requires at least one alpha numeric character.

EDIT2: To clarify, I don't want the any number of whitespace (including 0) to be accepted unless there is at least 1 alphanumeric character

The pattern I suggested will not accept only white space characters only.

edited Dec 02 '10 at 22:12

answered Dec 02 '10 at 21:23

aioobe

413,195
112
811
826

No, `\p{Alnum}` is *not* “any alphanumeric character” in Java. It only looks like a Unicode property, but is in fact merely a POSIX character-class,operative in the ASCII-only C locale. A reasonably concise substitute for the 1960s-style ASCII is to use `[\pL\pN]`, although that may not catch all the things you really ought to catch, such as diacritics and connector punctuation. Also, that impoverished whitespace description is like, **so antemillennial!** There *is* a correct version that breaks you out of the 1960s, but unfortunately the margin is too small here to write the correct answer.☺ – tchrist Dec 02 '10 at 22:07
Hehe, well that's a verbatim copy from the Java API documentation, you go file a documentation bug-report ;-) Besides, it says that it matches *an* alphanumeric character, not *all* alphanumeric characters, right? ;) – aioobe Dec 02 '10 at 22:20
Notwithstanding that the Java `Pattern` API documentation varies from sketchy to misleading to downright **wrong**, such casuistry is unlikely to persuade a jury of your peers. And the bug is not merely in their documentation. It is the library itself, which really cannot be used as is for anything except literal enumerated codepoints or Unicode properties. The rest is pure bollocks. It requires serious surgery — **if** you can de-capon it. [Read ’em and weep!](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261) – tchrist Dec 02 '10 at 22:53

Help with regex

4 Answers4