1

I have an string, and I want to do some things with it if it is a valid XML; and If not, tell the user that the string is not a valid XML.

My code is this:

try 
{

    XmlDocument doc = new XmlDocument();
    doc.LoadXml(rawData);

    //And here I want to do some things with doc if it is a valid XML.
}
catch
{
    //Tell the user that the string is not a valid XML.
}

Now, If rawData contains a valid XML data, there is no problem. Also if rawData contains something else (like HELLOEVERYBODY!), It will throw an exception, So I can tell the user the string is not a valid XML.

But When rawData contains a HTML page, The process takes a long time (more than 20 seconds!)...

It may differ from page to page. for example, it can process stackoverflow.com quickly, but processing 1pezeshk.com takes a long long time...

Isn't there any faster way to validate XML before loading it into a XmlDocument?

Mahdi Ghiasi
  • 14,873
  • 19
  • 71
  • 119

1 Answers1

2

I've seen this before and the problem is that XmlDocument tries to download the DTD for the document. In your sample this is http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd which lets you open a connection but never returns anything. So a simple solution (without any type of error checking mind you) is to remove anything before the -tag like this.

WebClient wc = new WebClient();
wc.Encoding = Encoding.UTF8;
string data = wc.DownloadString("http://1pezeshk.com/");
data = data.Remove(0, data.IndexOf("<html"));
XmlDocument xml = new XmlDocument();
xml.LoadXml(data);

Edit

Browsing to http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd actully returns the DTD, but it took well over a minute to respond. Since you still won't do DTD-validation you should really just strip this from your HTML and then try to validate it as HTML.

Karl-Johan Sjögren
  • 16,544
  • 7
  • 59
  • 68