Regular Expression to match ([^>(),]+) but include some \w's in it?

Question

I'm using php's preg_replace function, and I have the following regex:

(?:[^>(),]+)

to match any characters but >(),. The problem is that I want to make sure that there is at least one letter in it (\w) and the match is not empty, how can I do that?

Is there a way to say what i DO WANT to match in the [^>(),]+ part?

You should probably expression functionnally what you want to do ? If you want to look for non-empty words, you can simply use `\w+` — Antoine Pelisse, Nov 29 '10 at 12:16
Could you be more specific? Do you want them included in that exact sequence? — EarlyPoster, Nov 29 '10 at 12:23
I'll be more specific. I have the following exp: $exp = " div.class#id > table( tr > td > lable, tr > td > input value=\$val )"; And want to be able to match these: (div.class#id) (table) (tr, etc...) (input value=$val) — Fábio de Lima Souto, Nov 29 '10 at 12:40

Tim Pietzcker · Accepted Answer · 2010-11-29T13:17:14.827

1

You can add a lookahead assertion:

(?:(?=.*\p{L})[^>(),]+)

This makes sure that there will be at least one letter (\p{L}; \w also matches digits and underscores) somewhere in the string.

You don't really need the (?:...) non-capturing parentheses, though:

(?=.*\p{L})[^>(),]+

works just as well. Also, to ensure that we always match the entire string, it might be a good idea to surround the regex with anchors:

^(?=.*\p{L})[^>(),]+$

EDIT:

For the added requirement of not including surrounding whitespace in the match, things get a little more complicated. Try

^(?=.*\p{L})(\s*)((?:(?!\s*$)[^>(),])+)(\s*)$

In PHP, for example to replace all those strings we found with REPLACEMENT, leaving leading and trailing whitespace alone, this could look like this:

$result = preg_replace(
    '/^          # Start of string
    (?=.*\p{L})  # Assert that there is at least one letter
    (\s*)        # Match and capture optional leading whitespace  (--> \1)
    (            # Match and capture...                           (--> \2)
     (?:         # ...at least one character of the following:
      (?!\s*$)   # (unless it is part of trailing whitespace)
      [^>(),]    # any character except >(),
     )+          # End of repeating group
    )            # End of capturing group
    (\s*)        # Match and capture optional trailing whitespace (--> \3)
    $            # End of string
    /xu', 
    '\1REPLACEMENT\3', $subject);

edited Nov 29 '10 at 13:17

answered Nov 29 '10 at 12:09

Tim Pietzcker

328,213
58
503
561

in *perl* at least, this doesn't work quite as specified by the OQ. For example, the string "123(p" passes the lookahead assertion (due to matching .*p) but fails the requirement that the captured group include the p (it doesn't). I may have misunderstood the php requirements. – Alex Brown Nov 29 '10 at 12:48
I would also like to know if there is a way to stop the engine matching whitespaces \s at the end. I mean, it can be at the middle but not at end or begining? – Fábio de Lima Souto Nov 29 '10 at 12:55
@Alex Brown: You're right; it's probably a good idea to use some anchors here. Will edit my answer. – Tim Pietzcker Nov 29 '10 at 12:55
@Fábio: Do you mean that you want the match to fail if there is whitespace at the front/end, or do you want the match to succeed but without adding the surrounding whitespace to the match? – Tim Pietzcker Nov 29 '10 at 12:58
I want to match the succeed but without adding the whitespace, exactly. I've already tried (?!\s+) as i've seen explained in the page of the link you gave me, but no success – Fábio de Lima Souto Nov 29 '10 at 13:04
@tchrist: I know. My statement is still correct (although not complete). And since I recommended against `\w` anyway, I thought that this level of detail would have been too much at this point. – Tim Pietzcker Nov 29 '10 at 13:39
@Tim, `\w` is a lot broader than “also digits and underscores.” In regex dialects that support set union and intersection operations on bracketed character classes, a `\w` is equivalent to `[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]`. That’s how you have to write in Java, which has no `\w` support for Unicode. That makes `\b` extremely awkward to write. – tchrist Nov 29 '10 at 13:39
@Tim, Agreed. You need the detail only in languages without Unicode support for the charclass shortcuts. A Unicode `\b` in Java must be written `(?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))`, which is horrid. – tchrist Nov 29 '10 at 13:40
`\b` is just `(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))`, but those `\w`’s need expansion into an unsightliness that makes the Unicode whitespace pattern in Java downright friendly: `[\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u2029\u202F\u205F\u3000]`, eh? :) – tchrist Nov 29 '10 at 13:42
Lastly, `\B` is really `(?:(?<=\w)(?=\w)|(?<!\w)(?!\w))`. I often find `\b` not quite what I want, preferring instead `(?:(?<=^)|(?<=\s))` for a left edge and `(?=$|\s)` for a right edge. Those aren’t quite so weird as `\b`, but you do have to specify left vs right; in practice, this is no different than having to distinguish `^` vs `$`. – tchrist Nov 29 '10 at 13:47
@tchrist: This is fascinating. I will ask a question about this and would be grateful if you could provide a similar exhaustive answer there. This makes reading (and bookmarking) this important piece of knowledge easier. – Tim Pietzcker Nov 29 '10 at 14:48
@Tim: Sure thing. Most of my knowledge you can glean from reading [this answer’s source code](http://stackoverflow.com/questions/3537706/howto-unescape-a-java-string-literal-in-java/4298836#4298836), which I posted yesterday answering an old question (and which nobody but me has yet looked at, darn it). But if you ask a question with the `regex` tag, I’ll find it when I get to work in an hour and answer with a much more distilled summary of all the exact mappings for `\w \W \s \S \v \V \h \H \d \D \b \B \X \R`. – tchrist Nov 29 '10 at 14:56

score 0 · Answer 2 · answered Nov 29 '10 at 12:32

0

You can just "insert" \w inside (?:[^>(),]+\w[^>(),]+). So it will have at least one letter and obviously not empty. BTW \w captures digits as well as letters. If you want only letters you can use unicode letter character class \p{L} instead of \w.

answered Nov 29 '10 at 12:32

alpha-mouse

4,953
24
36

hmmm, thats so simple, how didn't I figure it out! Yes, i just wanted to make sure that there was at least one letter in the match – Fábio de Lima Souto Nov 29 '10 at 12:52
This is slightly stricter than the OQ, since it permits .A. but not .A A. or A – Alex Brown Nov 29 '10 at 12:55
@Alex Brown: yes, you are right, stars should be used instead of pluses. – alpha-mouse Nov 29 '10 at 12:57
Actually, [according to Unicode](http://unicode.org/reports/tr18/#Compatibility_Properties), `\w` comprises `\pL` all *Letters,* `\pM` all *Marks,* `\p{Nd}` the *Decimal Numbers,* the `\p{Nl}` the *Letter Numbers,* `\p{Pc}` the *Connector Punctuation,* plus all code points which are both `\p{InEnclosedAlphanumerics}` and also `\p{So}`, the *Other Symbols.* – tchrist Nov 29 '10 at 13:32

score 0 · Answer 3 · answered Nov 29 '10 at 12:53

0

How about this:

(?:[^>(),]*\w[^>(),]*)

answered Nov 29 '10 at 12:53

Alex Brown

41,819
10
94
108

I slightly modified yours so this would work for me: (?:[^>(),]*\p{L}[^>(),]*) – Fábio de Lima Souto Nov 29 '10 at 13:14

Regular Expression to match ([^>(),]+) but include some \w's in it?

3 Answers3