Need some C# Regular Expression Help

Question

I'm trying to come up with a regular expression that will stop at the first occurence of </ol>. My current RegEx sort of works, but only if </ol> has spaces on either end. For instance, instead of stopping at the first instance in the line below, it'd stop at the second

some random text <a href = "asdf">and HTML</a></ol></b> bla </ol>

Here's the pattern I'm currently using: string pattern = @"some random text(.|\r|\n)*</ol>";

What am I doing wrong?

3

_Please_, use an HTML parser. – SLaks Jul 06 '11 at 03:03

score 3 · Answer 1 · answered Jul 06 '11 at 03:02

3

string pattern = @"some random text(.|\r|\n)*?</ol>";

Note the question mark after the star -- that tells it to be non greedy, which basically means that it will capture as little as possible, rather than the greedy as much as possible.

answered Jul 06 '11 at 03:02

Mike Caron

14,351
4
49
77

score 2 · Answer 2 · answered Jul 06 '11 at 03:03

2

Make your wild-card "ungreedy" by adding a ?. e.g.

some random text(.|\r|\n)*?</ol>
                          ^- Addition

This will make regex match as few characters as possible, instead of matching as many (standard behavior).

Oh, and regex shouldn't parse [X]HTML

answered Jul 06 '11 at 03:03

Brad Christie

100,477
16
156
200

Why shouldn't it parse HTML? Also, for whatever reason, now when I try the ungreedy version of it, it doesn't return anything at all..? – Sootah Jul 06 '11 at 03:16
@Sootah: because html is too erratic for regex to be a feasible option. There are too many variables/inconsistencies involved. Also, try adding a capture group to your expression: `(some random text(?:.|\r|\n)*?)` (note the parenthesis added, and switching your "either or" group to an optional group (`(?:)`)) – Brad Christie Jul 06 '11 at 03:19
@Sootah: See also this post: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Brad Christie Jul 06 '11 at 03:20
I'm not trying to do anything even remotely complex with it, I simply want it to return from the beginning of the search string until the first occurence of . – Sootah Jul 06 '11 at 03:33
@Brad: Hogwash. Trying to _parse_ html with regex is a fools errand, but extracting text is not only okay, it's what regular expressions were designed to do. I dare say that for a purpose like this, the regular expression is _more_ flexible than using a DOM tree or something. – Mike Caron Jul 06 '11 at 14:31
@MikeCarson: I'm not going to argue the point, merely say that HTML is not ideal for regex. You get in to too many exceptions, rules, conditions, etc. when you start trying to grab simple strings from HTML strings. (for example, getting "Hello, World!" from `Hello, World!` – Brad Christie Jul 06 '11 at 14:36

score 1 · Accepted Answer · answered Jul 06 '11 at 04:50

1

While not a Regex, why not simply use the Substring functions, like:

string returnString = someRandomText.Substring(0, someRandomText.IndexOf("</ol>") - 1);

That would seem to be a lot easier than coming up with a Regex to cover all the possible varieties of characters, spaces, etc.

answered Jul 06 '11 at 04:50

Tim

28,212
8
63
76

The main issue is that I have positively no idea where in the string the chunk I'm going to need is, if it were fixed each time (as I'm grabbing some stuff off the web) then I'd just use a starting and ending index location or something similar. The only bits I have are that I know exactly what few unique words will be in front of the section I need each time, and that the best way to find the "end" of what I need is via the /ol tag that comes after the bits I want. I can then clean it up from there. – Sootah Jul 06 '11 at 05:44
Although, now that I think about it, I could probably grab the index location of the start of the uniqu identifier, use that as the starting point, and then use the function that you provided... I'll have to give that a shot, although if there is a way to search with a starting phrase, then grab anything and everything in between it and an end "phrase" with RegEx that'd be ideal. – Sootah Jul 06 '11 at 05:47
So, for example, if "some random text" is the the start, you can set the start index equal to the position of that string and then extract to the location of the </ol> - i.e., int start = str.IndexOf("some random text") + 15; str.Substring(start, str.IndexOf("") - 1 - start); - it's a little more involved, but not as involved as a Regex. – Tim Jul 06 '11 at 05:52
I did end up using this method, and then a simple RegEx - <(.|\n)*?> - to strip HTML tags out of an array that I created. Works beautifully. – Sootah Jul 27 '11 at 10:53

score 0 · Answer 4 · answered Jul 06 '11 at 05:17

This regex matches everything from the beginning of the string up to the first </ol>. It uses Friedl's "unrolling-the-loop" technique, so is quite efficient:

Regex pattern = new Regex(
    @"^[^<]*(?:(?!</ol\b)<[^<]*)*(?=</ol\b)",
    RegexOptions.IgnoreCase);
resultString = pattern.Match(text).Value;

score 0 · Answer 5 · answered Jul 06 '11 at 05:57

Others had already explained the missing ? to make the quantifier non greedy. I want to suggest also another change.

I don't like your (.|\r|\n) part. If you have only single characters in your alternation, its simpler to make a character class [.\r\n]. This is doing the same thing and its better to read (I don't know compiler wise, maybe its also more efficient).

BUT in your special case when the alternatives to the . are only newline characters, this is also not the correct way. Here you should do this:

Regex A = new Regex(@"some random text.*?</ol>", RegexOptions.Singleline);

Use the Singleline modifier. It just makes the . match also newline characters.

Need some C# Regular Expression Help

5 Answers5