regex: matching phrases without a > or white space

Question

I'm parsing some html using regex and I want to match lines which start with a word without any html tags while also removing the white space. Using c# regex my first pattern was:

pattern = @"^\s*([^<])";

which attempts to grab all the white space and then capture any non '<' characters. Unfortunately if the line is all white space before the first '<' this returns the last white space character before the '<'. I would like this to fail the match.

Any ideas?

Can I refer you to [my answer](http://stackoverflow.com/questions/792679/need-help-writing-regular-expression-html-parsing/792686#792686) to another similar question ? — Brian Agnew, Apr 27 '09 at 10:18
The HTML parsing has been discussed a lot. Refer to this post: [Using regular expressions to parse HTML: why not?](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) — Jérôme, Apr 27 '09 at 10:16

score 3 · Answer 1 · answered Apr 27 '09 at 10:16

Don't use regular expressions to parse HTML. It's a really bad idea and, at best, your code will be flaky. Whatever your language/platform is you'll have a fully-functional HTML parser available. Just use that.

There is no way a regular expression can correctly handle all the cases of escaping, entity use and so on.

score 1 · Accepted Answer · answered Apr 27 '09 at 10:27

Asked the question to soon, just worked out this:

pattern = @"^\s*((?!\s)[^<]+)";

Thanks for the feedback about regex and html, I'll bare it in mind for the future. I'm writing a utility program to make a few pages multi-language (i.e: add asp:literals for hardcoded text etc), I think regex is sufficient for this purpose but if there are better tools please let me know (web stuff isn't my area...).

regex: matching phrases without a > or white space

2 Answers2