0

I have the following site and I want with regular expressions to get the text between the following tags

<td colspan="2" align="left" valign="top" bgcolor="#FBFAF4"> ..... </td>

I am trying with the following however it returns an empty array of $matches.

preg_match_all("/<td(.*) bgcolor=\"#FBFAF4\"\>(.*)\<\/td>/",$old_filecontents,$matches);

Which is the correct pattern for this?

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Exotiq - Ðñïúüíôá</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-7"> <link href="Styles.css" rel="stylesheet" type="text/css"> <link href="stylesheets/Styles.css" rel="stylesheet" type="text/css"> <script src="scripts/PopBox.js" type="text/javascript"></script> <script type="text/javascript"> popBoxWaitImage.src = "images/spinner40.gif"; popBoxRevertImage = "images/magminus.gif"; popBoxPopImage = "images/magplus.gif"; </script> <script type="text/javascript"> AC_FL_RunContent('codebase', 'http://download.macromedia.com/pub/shockwave/ cabs/flash/swflash.cab#version=9,0,28,0', 'width','675','height','445','title','Morpork', 'src','assets/flash/morepork','loop', 'false','quality','high','pluginspage', 'http://www.adobe.com/shockwave/download/download.cgi?P1_Prod_Version=ShockwaveFlash', 'wmode','transparent','movie','assets/flash/morepork'); </script> </head> <body background="images/fonto2.jpg" topmargin="0"> <table width="948" border="0" align="center" cellpadding="0" cellspacing="0"> <tr> <td><table width="948" border="0" align="center" cellpadding="0" cellspacing="0"> <tr> <td width="24">&nbsp;</td> <td height="150" colspan="3"><object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,29,0" width="900" height="150"> <param name="movie" value="flash/top02.swf"> <param name="quality" value="high"> <param name="wmode" value="transparent"> <embed src="flash/top02.swf" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" type="application/x-shockwave-flash" width="900" height="150"></embed></object></td> <td width="24" height="150">&nbsp;</td> </tr> <tr> <td height="31" colspan="5" valign="middle"> <div align="center"> <script src="menu/xaramenu.js"></script> <script Webstyle4 src="menu/menu_.js"></script> </div></td> </tr> <tr> <td width="24">&nbsp;</td> <td width="200" valign="top" background="images/GreenFasa.jpg"> <br> <table width="180" border="0" align="center" cellpadding="0" cellspacing="1"> <tr> <td height="25" class="styles"> &nbsp;<a href="MakutiUmbrela.html" class="styles">Makuti</a><br> <hr> </td> </tr> <tr> <td height="25" class="styles"> &nbsp;<a href="FunPalmUmbrela.html" class="styles">Fun Palm</a><br> <hr> </td> </tr> <tr> <td height="25" class="styles"> &nbsp;<a href="AlangUmbrela.html" class="styles">Alang-Alang</a><br> <hr> </td> </tr> <tr> <td height="25" class="styles"> &nbsp;<a href="ThatchUmbrela.html" class="styles">Thatch</a><br> <hr> </td> </tr> <tr> <td height="25" class="styles"> &nbsp;<a href="AbacaUmbrela.html" class="styles"><strong>Abaca</strong></a><br> <hr> </td> </tr> <tr> <td height="25" class="styles">&nbsp; </td> </tr> </table></td> <td colspan="2" align="left" valign="top" bgcolor="#FBFAF4"> <div align="left"> <table width="680" border="0" align="center" cellpadding="0" cellspacing="0"> <tr> <td width="600" height="40" class="titles">ÊáôáóêåõÝò - ÏìðñÝëåò - Abaca</td> <td width="50" align="right" valign="middle" class="titles"> <div align="right"><a href="/AbacaUmbrela_en.html"><img src="images/uk-flag.jpg" width="30" height="17" border="0"></a></div></td> </tr> <tr> <td colspan="2" class="body"><p>Ç ïìðñÝëá <strong>Abaca</strong> Ýñ÷åôáé ùò Üîéïò áíôéêáôáóôÜôçò ôçò ïìðñÝëáò Rattan ðïõ åðß 15 ÷ñüíéá óôïëßæåé ôéò åëëçíéêÝò ðáñáëßåò. Ôï <strong>Abaca</strong> åßíáé Ýíá öõóéêü õëéêü ðéï <strong>áíèåêôéêü</strong> êáé ðéï üìïñöï áðü ôï Rattan. <br> Ðáñáäßäåôáé ìå <strong>îýëéíï êïñìü åìðïôéóìïý</strong> Ö8åê.<br> <br> </p> <table width="680" border="0" cellspacing="0" cellpadding="0"> <tr> <td width="340" height="150" valign="middle"> <div align="left"><img src="images/Manufactures/Umbrelas/Abaca/AbacaUmbrela.jpg" width="328" height="500"></div></td> <td width="340" height="150" valign="bottom" class="body"> <table width="340" border="0" cellspacing="0" cellpadding="0"> <tr> <td width="170" height="130"> <div align="center"><img src="images/Manufactures/Umbrelas/Abaca/1_Abaca02_s.jpg" width="152" height="101" class="PopBoxImageSmall" onclick="Pop (this,50,'PopBoxImageLarge');" title="ÌåãÝèõíóç" pbsrc="images/Manufactures/Umbrelas/Abaca/1_Abaca02.jpg" pbCaption="Abaca - ÏìðñÝëá ðáñáëßáò" popBoxCaptionBelow="true" /></div></td> <td width="170" height="130"> <div align="center"><img src="images/Manufactures/Umbrelas/Abaca/2_Abaca03_s.jpg" width="150" height="112" class="PopBoxImageSmall" onclick="Pop (this,50,'PopBoxImageLarge');" title="ÌåãÝèõíóç" pbsrc="images/Manufactures/Umbrelas/Abaca/2_Abaca03.jpg" pbCaption="Abaca - ÏìðñÝëá ðáñáëßáò" popBoxCaptionBelow="true" /></div></td> </tr> <tr> <td width="170" height="130"> <div align="center"><img src="images/Manufactures/Umbrelas/Abaca/3_Abaca01_s.jpg" width="150" height="112" class="PopBoxImageSmall" onclick="Pop (this,50,'PopBoxImageLarge');" title="ÌåãÝèõíóç" pbsrc="images/Manufactures/Umbrelas/Abaca/3_Abaca01.jpg" pbCaption="Abaca - ÏìðñÝëá ðáñáëßáò" popBoxCaptionBelow="true" /></div></td> <td width="170" height="130"> <div align="center"></div></td> </tr> <tr> <td width="170" height="130"> <div align="center"></div></td> <td width="170" height="130"> <div align="center"></div></td> </tr> <tr> <td width="170" height="130"> <div align="center"></div></td> <td width="170" height="130"> <div align="center"></div></td> </tr> </table></td> </tr> <tr> <td width="340" height="50" valign="top"> <p align="center">&nbsp;</p></td> <td width="340" height="50" valign="top"> <div align="center" class="perigrafes">ÊëéêÜñåôáé ðÜíù óôéò öùôïãñáößåò ãéá ìåãÝèõíóç</div></td> </tr> <tr> <td width="340" valign="bottom"> <div align="center"> </div></td> <td width="340" valign="bottom"> <p align="center">&nbsp; </p></td> </tr> <tr> <td width="340" valign="top"> <div align="center"></div></td> <td width="340" valign="top"> <p align="center">&nbsp;</p></td> </tr> <tr> <td height="20" colspan="2" valign="top">&nbsp;</td> </tr> </table></td> </tr> </table> <font color="#FFFFFF"></font></div></td> <td width="24" height="420">&nbsp;</td> </tr> <tr> <td width="24">&nbsp;</td> <td width="200">&nbsp;</td> <td width="600">&nbsp;</td> <td width="100">&nbsp;</td> <td width="24">&nbsp;</td> </tr> </table></td> </tr> <tr> <td height="22"><table width="900" border="0" align="center" cellpadding="0" cellspacing="0" bgcolor="#007F3E"> <tr> <td height="25"> <div align="center" class="styles">All rights reserved &reg; Designed by CONTINENTAL ADVERTISING </div></td> </tr> </table></td> </tr> </table> <script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-12742174-1"); pageTracker._trackPageview(); } catch(err) {}</script> </body> </html>
billaraw
  • 938
  • 1
  • 7
  • 28
  • What do you mean with "Here it comes"? – billaraw Mar 09 '11 at 15:07
  • 2
    Here this comes: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags Basically. Don't use regex to parse html. There are plenty of good parsers out there that you should use. – Yacoby Mar 09 '11 at 15:07
  • Another day, another interesting way to use regular expressions ... – Jeff Parker Mar 09 '11 at 15:08
  • Really beautiful HTML code :) – krtek Mar 09 '11 at 15:09
  • you escape some brackets and some you dont.. if you fix that the regex works on the first example, but on the whole thingi it does return some strange shit see here: http://regexp-evaluator.de/evaluator/2e53b8967d310cf6d73546d1edbd283d/#ergebnis – Flo Mar 09 '11 at 15:20

1 Answers1

2

Given that the cell you're talking about contains HTML, another table in fact, you can't do traditional termination checking ... or you'll get the content between the cell opening and the first </td> you find. Plus '.' isn't multi-line friendly, so unless your cell opens and terminates on the same line, you'll get no matches.

I'd say don't use regular expressions for this. Try an XML parser.

If you were just getting plain text, that'd be fine, but because you're returning HTML which contains your terminator, you'll need to use a parser with some kind of DOM depth awareness ... ... or find a way to count terminators in regex.

Jeff Parker
  • 7,367
  • 1
  • 22
  • 25
  • Jea if theres another table in between you might not find the correct . But there still is the possibilty to just explode the whole thing – Flo Mar 09 '11 at 15:17
  • 2
    Eventually I used this http://simplehtmldom.sourceforge.net/ and got the job done really fast. – billaraw Mar 09 '11 at 22:49