2

Here's some lines of the document:

  <div class="rowleft">
    <h3>Technical Fouls</h3>

    <table class="num-left">
      <tr class="datahl2b"> 
        <td>&nbsp;</td>
            <td>Players</td>
          </tr>
          <tr> 
            <td>DAL</td>
            <td>
              None</td>

          </tr>
          <tr> 
            <td>MIA</td>
            <td>
              Mike Miller</td>
            <td>
              Mike Miller, Jr.</td>
          </tr>
        </table>
    </div> 

I'm interested in extracting the None and Mike Miller and Mike Miller, Jr. from this. I tried using various XML parsers, but 1) the performance is abysmal and 2) the document is apparently not a properly formatted XML document.

One thing I've been thinking about is stripping the document of newlines, splitting it at something like <tr>, seeing which lines contain data (probably using StartsWith()), and extracting it with a regex. That would be efficient enough for my program (doesn't really matter that it takes half a second when downloading the document is five seconds), but I'm interested it alternative solutions.

Hui
  • 13,887
  • 8
  • 25
  • 20
  • 1
    The performance is abysmal? Can you show the code you've been trying? – Tomalak Jun 05 '11 at 14:32
  • @Tomalak: Sorry, it's long gone. But loading the document into a variable took 20+ seconds. – Hui Jun 05 '11 at 14:38
  • @Hui that sounds like the download taking time, not the loading. What happens if you split them into two different lines? One downloading and one starting the parser – Oskar Kjellin Jun 05 '11 at 14:47
  • @Oskar: Downloading it using `WebClient` takes 4-5 secs, so that's not the issue. – Hui Jun 05 '11 at 14:51
  • @Hui *"it's long gone"*? How can that be? – Tomalak Jun 05 '11 at 14:52
  • @Tomalak: I don't keep track of all the code I've ever deleted. I tried that several hours ago. I remember using `XmlDocument` and `XmlTextReader`. – Hui Jun 05 '11 at 14:53
  • @Hui Strange logic. How would you expect anyone to be able to tell you what's wrong when you don't keep the code that doesn't work? – Tomalak Jun 05 '11 at 15:00
  • @Hui. Just for info, the reason that your snippet is not XML well formed is that ` ` is not an predefined entity reference in XML. – Alohci Jun 05 '11 at 16:25

2 Answers2

3

Relevant

HTML generally isn't properly formatted XML, I suggest you use something like the HTML Agility pack

Community
  • 1
  • 1
Paul Creasey
  • 28,321
  • 10
  • 54
  • 90
0

Trying to parse HTML with string manipulation and regexes is invariably going to be horribly error-prone.

If your document is not well-formed XML, I would recommend using the HTML Agility Pack

Sven
  • 21,903
  • 4
  • 56
  • 63