0

I'm using following regex pattern to check a string contains html.

string input = "<a href=\"www.google.com\">test</a>";
const string pattern = "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
Regex reg = new Regex(pattern);
var matches = reg.Matches(input);

It works fine but if string text value contains < or > characters it returns true too, but it shouldn't. For example the following is not considered an HTML tag in our system.

string input = "<test>";

How can I add to this pattern an AND for </ and />

Thanks

Sam Salim
  • 2,145
  • 22
  • 18
  • why don't you just use the `string.Contains()` method provided to you within C# why make things harder trying to figure out your RegEx when you could have gotten your results with a single line check.. just curious.. – MethodMan Oct 23 '14 at 15:09
  • 1
    I believe http://stackoverflow.com/a/1732454/603384 is relevant here. – Evan M Oct 23 '14 at 15:27

1 Answers1

3

I would not use regex to parse or validate HTML. You could use HtmlAgilityPack:

string input = "<a href=\"www.google.com\">test</a>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(input);
bool isValidHtml = doc.ParseErrors.Count() == 0;  // true

If you want to allow only specific tags you could create a white-list of allowed tags:

var whiteList = new List<string> { "a", "b", "img", "#text" }; //fill more whitelist tags
bool isValidHtmlAndTags = doc.ParseErrors.Count() == 0 && doc.DocumentNode.Descendants()
    .All(node => whiteList.Contains(node.Name));
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • it is a very big project, I cannot add a library or component by myself, so I need to do it with regex. – Sam Salim Oct 24 '14 at 06:34