Extracting data from an XML document without using an XML parser

Question

Here's some lines of the document:

  <div class="rowleft">
    <h3>Technical Fouls</h3>

    <table class="num-left">
      <tr class="datahl2b"> 
        <td>&nbsp;</td>
            <td>Players</td>
          </tr>
          <tr> 
            <td>DAL</td>
            <td>
              None</td>

          </tr>
          <tr> 
            <td>MIA</td>
            <td>
              Mike Miller</td>
            <td>
              Mike Miller, Jr.</td>
          </tr>
        </table>
    </div>

I'm interested in extracting the None and Mike Miller and Mike Miller, Jr. from this. I tried using various XML parsers, but 1) the performance is abysmal and 2) the document is apparently not a properly formatted XML document.

One thing I've been thinking about is stripping the document of newlines, splitting it at something like <tr>, seeing which lines contain data (probably using StartsWith()), and extracting it with a regex. That would be efficient enough for my program (doesn't really matter that it takes half a second when downloading the document is five seconds), but I'm interested it alternative solutions.

The performance is abysmal? Can you show the code you've been trying? — Tomalak, Jun 05 '11 at 14:32
@Tomalak: Sorry, it's long gone. But loading the document into a variable took 20+ seconds. — Hui, Jun 05 '11 at 14:38
@Hui that sounds like the download taking time, not the loading. What happens if you split them into two different lines? One downloading and one starting the parser — Oskar Kjellin, Jun 05 '11 at 14:47
@Oskar: Downloading it using `WebClient` takes 4-5 secs, so that's not the issue. — Hui, Jun 05 '11 at 14:51
@Tomalak: I don't keep track of all the code I've ever deleted. I tried that several hours ago. I remember using `XmlDocument` and `XmlTextReader`. — Hui, Jun 05 '11 at 14:53
@Hui Strange logic. How would you expect anyone to be able to tell you what's wrong when you don't keep the code that doesn't work? — Tomalak, Jun 05 '11 at 15:00
@Hui. Just for info, the reason that your snippet is not XML well formed is that ` ` is not an predefined entity reference in XML. — Alohci, Jun 05 '11 at 16:25

score 3 · Accepted Answer · edited May 23 '17 at 12:04

3

Relevant

HTML generally isn't properly formatted XML, I suggest you use something like the HTML Agility pack

edited May 23 '17 at 12:04

Community

1
1

answered Jun 05 '11 at 14:33

Paul Creasey

28,321
10
54
90

score 0 · Answer 2 · answered Jun 05 '11 at 14:33

0

Trying to parse HTML with string manipulation and regexes is invariably going to be horribly error-prone.

If your document is not well-formed XML, I would recommend using the HTML Agility Pack

answered Jun 05 '11 at 14:33

Sven

21,903
4
56
63

Extracting data from an XML document without using an XML parser

2 Answers2