Regex - pattern capture everything except for pattern [.net]

Question

I would like to capture anything up to, but not including a particular patter. My actual problem has to do with parsing out information from html, but I am distilling the problem down to an example to, hopefully, clarify my question.

Source

xaxbxcabcabc

Desired Match

xaxbxc

If I use a lookahead the expression will capture the first occurrence

.*(?=abc) => xaxbxcabc

I would like something along the lines of a negated character class, just for a negated pattern.

.*[^abc] //where abc as a pattern instead of a list giving anything but a, b or c

I am using http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx for testing

[Regex is not for parsing HTML.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Matti Virkkunen, Feb 23 '11 at 19:35
You might find http://regexhero.net/tester/ to be a good tester as well. — driis, Feb 23 '11 at 19:37
@Mormegil's answer to use `*?` is the one you want. Just FYI, it's possible to have a negative lookahead, so your last code block would become `.*(?!abc)`. However, that doesn't seem relevant to your situation, nor is negative lookbehind. `.*(?!abc)` would capture `xaxbxcabcabc` from your sample, and `.*(?<!abc)` would capture `xaxbxcabcab`. — Justin Morgan - On strike, Feb 23 '11 at 20:10
@Matti - I understand the opposition to using regex to parse html. My case may (or I could easily be wrong) be a bit different. In this case, I am trying to pull specific information out of a specific page that where the html is poorly formatted and contains no semantic signals to the meaning of the content. I am using regex to find contextual indications of the meaning of content. The result will be a brittle data capture function that I know I will have to edit anytime the site owner changes markup. In an ideal world, they would provide an api, or at least generate better html. — Hypnovirus, Feb 23 '11 at 20:52
@driis - Thanks for the suggestion, I will check out that tester. — Hypnovirus, Feb 23 '11 at 20:56

Mormegil · Answer 1 · 2011-02-23T19:50:12.160

4

A non-greedy (lazy) quantifier *? could be useful here, e.g.

^(?<captured>.*?)abc.*$

Edit: Just to be clear, the explicit capture is (of course) not needed, the really important part is just

(.*?)abc

edited Feb 23 '11 at 19:50

answered Feb 23 '11 at 19:38

Mormegil

7,955
4
42
77

Thanks for the response. I wish I could select multiple accepted answers. – Hypnovirus Feb 23 '11 at 21:03

score 3 · Accepted Answer · answered Feb 23 '11 at 20:02

3

If you anchor the regex you'll solve the problem (+ use of lazy quantifier):

"^.*?(?=abc)"

answered Feb 23 '11 at 20:02

xanatos

109,618
12
197
280

Thanks for the response. This is the solution I decided to go with. – Hypnovirus Feb 23 '11 at 21:02

score 2 · Answer 3 · answered Feb 23 '11 at 19:37

2

Why not use a replace:

string result = new Regex("abc.*$").Replace ( input, "" );

This will remove everything from the first matching phrase onwards, leaving you with all of the content up until that point.

answered Feb 23 '11 at 19:37

Dexter

18,213
4
44
54

Thanks for the answer. For the example I used, this would not only work, but probably be the cleanest solution. However, in the case I am working on, it would add a step. I am using a lookbehind to initiate the pattern. So, I would have to match everything after the lookbehind and then do the replace on that match. – Hypnovirus Feb 23 '11 at 20:36

Regex - pattern capture everything except for pattern [.net]

3 Answers3