4

I have some very simple code:

        XmlDocument doc = new XmlDocument();
        Console.WriteLine("loading");
        doc.Load(url);
        Console.WriteLine("loaded");

        XmlNodeList nodeList = doc.GetElementsByTagName("p");

        foreach(XmlNode node in nodeList)
        {
            Console.WriteLine(node.ChildNodes[0].Value);
        }
        return source;

I'm working on this file and it takes two minutes to load. Why does it take so long? I tried both with fetching and file from the net and loading a local file.

Justin
  • 84,773
  • 49
  • 224
  • 367
John
  • 45
  • 2
  • 5

2 Answers2

9

I imagine it's the DTD of the page that's taking so long to load. Given that it defines entities, you shouldn't disable it, so you're probably better off not going down this path.

Given the inner workings of the wikipedia parser (a right mess), I'd say it's a big leap to assume it's going to produce well-formed XHTML every time.

Use HTML Agility Pack to parse (then you can convert to XmlDocument a little more easily if required, IIRC).

If you really want to go down the XmlDocument route you can keep a local cache of the HTML DTDs. See this post, this post and this post for details.

Community
  • 1
  • 1
spender
  • 117,338
  • 33
  • 229
  • 351
  • +1, beat me to the answer; if you download a copy and remove the DTD it parses right away, but then fails because ® is only defined in the DTD. – meklarian Apr 14 '11 at 01:00
  • 3
    The W3C throttles traffic to their DTD files, because they get buried in requests. You could use a custom entityresolver to load local copies of the DTD files. – Mads Hansen Apr 14 '11 at 01:26
5

It is becuase XmlDocument doesn't just load your Xml into a nice class heirarchy it also goes and fetches all of the namespace DTD's defined in the document. Run fiddler and you will see the calls to fetch

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent

These all took me about 20 seconds to fetch.

btlog
  • 4,760
  • 2
  • 29
  • 38