-1

I have to match a large amount of records in HTML. I want each record matched with a regular expression (using .NET Regex Match).

Each record is formatted like this (the total HTML contains of normal HTML and ~100 records like the following):


<tr onclick="window.location.href='Vareauktion.asp?VISSER=Ja&funk=detaljedata&ID=14457'" style="cursor:hand" onmouseover="bgColor='#808080'" onmouseout="bgColor='#4b4b4b'" bgcolor="#4b4b4b"> 
                            <td valign="top"> 
                            <div id='OrdreID14457'></div> 
                <script>RunTimer('OrdreID14457', '04-10-2010 14:30:22');</script> 
                            <em><font size="-1">04-10-2010 14:30:22</font></em></td> 
                            <td valign="top"> Voldby (28|0)</td> 
                            <td valign="top">02:16:00</td> 
                            <td valign="top">09-10-2010<br>15:30:22</td> 
                            <td valign="top">Modeltog <img src="images/Gods_Modeltog.gif" alt="Modeltog" height="15" border="0"></td> 
                            <td valign="top">6603 T.</td> 
                            <td valign="top"> 
                            <img src='images/moneter.gif' height='13' alt='Moneter'>5.751.213 

                            </td> 
                            <td valign="top"> 

                            </td> 
                            <td valign="top"> 

                            </td> 
                        </tr>

I've tried the following so far:

Regex:

id='OrdreID.*[^(<td colspan="9" height="1" bgcolor="#000000">)]*<td colspan="9" height="1" bgcolor="#000000">

What I am trying to do is the following:

  • Start my match at: id='OrdreID
  • Accept everything afterwards, UNTIL it sees: <td colspan="9" osv..
  • Then at last, I match the final:

With my current solution, I have the problem that the exclude pattern only matches chars, NOT strings..

I have been reading about "lookingahead", but I have no idea how to use it in this situation..

Thanks a lot!! Best regards, Lars

jball
  • 24,791
  • 9
  • 70
  • 92
Lars Holdgaard
  • 9,496
  • 26
  • 102
  • 182
  • 3
    [Friends don't let friends parse HTML with regular expressions.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Ether Oct 04 '10 at 16:30

2 Answers2

2

I see you've tried a saw where a screwdriver is needed.

Have you tried using an HTML parser?

maček
  • 76,434
  • 37
  • 167
  • 198
  • Ahh, that's a pretty damn good idea.. Never tried an HTML parser before, after I've been readin about it.. You're spot on.. Thanks! – Lars Holdgaard Oct 04 '10 at 16:45
0

Use the HtmlAgilityPack or a similar parser. If you must use Regex, and you don't care that much about robustness or maintainability, you could try something like:

string pattern = "(?<=id='OrdreID).+(?=<td colspan=\"9\" osv)";
jball
  • 24,791
  • 9
  • 70
  • 92