2

I would like to capture anything up to, but not including a particular patter. My actual problem has to do with parsing out information from html, but I am distilling the problem down to an example to, hopefully, clarify my question.

Source

xaxbxcabcabc

Desired Match

xaxbxc

If I use a lookahead the expression will capture the first occurrence

.*(?=abc) => xaxbxcabc

I would like something along the lines of a negated character class, just for a negated pattern.

.*[^abc] //where abc as a pattern instead of a list giving anything but a, b or c

I am using http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx for testing

Hypnovirus
  • 1,555
  • 1
  • 10
  • 21
  • 2
    [Regex is not for parsing HTML.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Matti Virkkunen Feb 23 '11 at 19:35
  • 1
    You might find http://regexhero.net/tester/ to be a good tester as well. – driis Feb 23 '11 at 19:37
  • @Mormegil's answer to use `*?` is the one you want. Just FYI, it's possible to have a negative lookahead, so your last code block would become `.*(?!abc)`. However, that doesn't seem relevant to your situation, nor is negative lookbehind. `.*(?!abc)` would capture `xaxbxcabcabc` from your sample, and `.*(?<!abc)` would capture `xaxbxcabcab`. – Justin Morgan - On strike Feb 23 '11 at 20:10
  • @Matti - I understand the opposition to using regex to parse html. My case may (or I could easily be wrong) be a bit different. In this case, I am trying to pull specific information out of a specific page that where the html is poorly formatted and contains no semantic signals to the meaning of the content. I am using regex to find contextual indications of the meaning of content. The result will be a brittle data capture function that I know I will have to edit anytime the site owner changes markup. In an ideal world, they would provide an api, or at least generate better html. – Hypnovirus Feb 23 '11 at 20:52
  • @driis - Thanks for the suggestion, I will check out that tester. – Hypnovirus Feb 23 '11 at 20:56

3 Answers3

4

A non-greedy (lazy) quantifier *? could be useful here, e.g.

^(?<captured>.*?)abc.*$

Edit: Just to be clear, the explicit capture is (of course) not needed, the really important part is just

(.*?)abc
Mormegil
  • 7,955
  • 4
  • 42
  • 77
3

If you anchor the regex you'll solve the problem (+ use of lazy quantifier):

"^.*?(?=abc)"
xanatos
  • 109,618
  • 12
  • 197
  • 280
2

Why not use a replace:

string result = new Regex("abc.*$").Replace ( input, "" );

This will remove everything from the first matching phrase onwards, leaving you with all of the content up until that point.

Dexter
  • 18,213
  • 4
  • 44
  • 54
  • Thanks for the answer. For the example I used, this would not only work, but probably be the cleanest solution. However, in the case I am working on, it would add a step. I am using a lookbehind to initiate the pattern. So, I would have to match everything after the lookbehind and then do the replace on that match. – Hypnovirus Feb 23 '11 at 20:36