1

I'm parsing some html using regex and I want to match lines which start with a word without any html tags while also removing the white space. Using c# regex my first pattern was:

pattern = @"^\s*([^<])";

which attempts to grab all the white space and then capture any non '<' characters. Unfortunately if the line is all white space before the first '<' this returns the last white space character before the '<'. I would like this to fail the match.

Any ideas?

Jérôme
  • 2,640
  • 3
  • 26
  • 39
Patrick
  • 8,175
  • 7
  • 56
  • 72
  • Can I refer you to [my answer](http://stackoverflow.com/questions/792679/need-help-writing-regular-expression-html-parsing/792686#792686) to another similar question ? – Brian Agnew Apr 27 '09 at 10:18
  • The HTML parsing has been discussed a lot. Refer to this post: [Using regular expressions to parse HTML: why not?](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) – Jérôme Apr 27 '09 at 10:16

2 Answers2

3

Don't use regular expressions to parse HTML. It's a really bad idea and, at best, your code will be flaky. Whatever your language/platform is you'll have a fully-functional HTML parser available. Just use that.

There is no way a regular expression can correctly handle all the cases of escaping, entity use and so on.

cletus
  • 616,129
  • 168
  • 910
  • 942
1

Asked the question to soon, just worked out this:

pattern = @"^\s*((?!\s)[^<]+)";

Thanks for the feedback about regex and html, I'll bare it in mind for the future. I'm writing a utility program to make a few pages multi-language (i.e: add asp:literals for hardcoded text etc), I think regex is sufficient for this purpose but if there are better tools please let me know (web stuff isn't my area...).

Patrick
  • 8,175
  • 7
  • 56
  • 72