1

After using curl i've got from an external page i've got all source code with something like this (the part i'm interested)

   (page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)

So i'm using preg_match_all, i want to get only "buy_tickets.gif"

$pattern_before = "<td valign='top' class='rdBot' align='center'>";
$pattern_after = "</td>";
$pattern = '#'.$pattern_before.'(.*?)'.$pattern_after.'#si';

preg_match_all($pattern, $buffer, $matches, PREG_SET_ORDER);

Everything fine up to now... but the problem it's becase sometimes that external pages changes and the image i'm looking for it's inside a link

(page...)<td valign='top' class='rdBot' align='center'><a href="blaa" title="ble"><img src="/images/buy_tickets.gif" border="0" alt="T"></a></td> (page...)

and i dunno how to get always my code to work (not just when the image gets no link)

hope u understand

thanks in advance

lonesomeday
  • 233,373
  • 50
  • 316
  • 318
Zuker
  • 456
  • 2
  • 6
  • 18

5 Answers5

5

Don't use regex to parse HTML, Use PHP's DOM Extension. Try this:

$doc = new DOMDocument;

@$doc->loadHTMLFile( 'http://ventas.entradasmonumental.com/eventperformances.asp?evt=18' ); // Using the @ operator to hide parse errors

$xpath  = new DOMXPath( $doc );

$img = $xpath->query( '//td[@class="BrdBot"][@align="center"][1]//img[1]')->item( 0 ); // Xpath->query returns a 'DOMNodeList', get the first item which is a 'DOMElement' (or null)

$imgSrc = $img->getAttribute( 'src' );

$imgSrcInfo = pathInfo( $imgSrc );

$imgFilename = $imgSrcInfo['basename']; // All you need
Community
  • 1
  • 1
Samuel Katz
  • 24,066
  • 8
  • 71
  • 57
  • @SalmanPK -- before recommending w3Schools, I suggest reading http://w3fools.com/. In brief, a lot of people consider them to be a poor quality resource. You could probably do better by linking to the PHP manual instead. – Spudley May 28 '11 at 17:00
  • @Supdley Thanks for pointing out :), Here is a good link about XPath Syntax: http://msdn.microsoft.com/en-us/library/ms256086.aspx – Samuel Katz May 28 '11 at 17:08
  • @SalmanPK I've tryed this but i can't make it work. Image it's inside a link inside a td with that class "rdbot" but it's not the only td with that class – Zuker May 28 '11 at 17:18
  • @Zuker, I have updated the code. Please check and lemme know if any problem :) It will help if you can provide the URL to the page you're trying to scrap. – Samuel Katz May 28 '11 at 18:33
  • @salmanPK this it's the page http://ventas.entradasmonumental.com/eventperformances.asp?evt=18. I want to catch this image "http://ventas.entradasmonumental.com/images/buy_tickets.gif". Using your code i'm getting Warning: DOMDocument::loadHTMLFile(): htmlParseStartTag: misplaced tag in ... Big thanks! – Zuker May 28 '11 at 18:41
  • @SalmanPK i'm getting an error, any ideas? Fatal error: Call to a member function getAttribute() on a non-object in /var/www/turiver.com/public_html/matias/entradas.php on line 18 line 18 it's $imgSrc = $img->getAttribute('src'); – Zuker May 30 '11 at 18:11
  • That's happening because the XPath query is unable to find any element(s). Has the page changed? – Samuel Katz May 30 '11 at 18:17
  • Updated the XPath query. BTW You can test your XPath queries using Firebug. ;) – Samuel Katz May 30 '11 at 18:24
  • Yes, page changes but there always the same... the image we are trying to get sometimes it's part of a link and sometimes not... can be fixed so it works both times? – Zuker May 31 '11 at 14:58
1

You're going to get lots of advice not to use regex for pulling stuff out of HTML code.

There are times when it's appropriate to use regex for this kind of thing, and I don't always agree with the somewhat rigid advice given on the subject here (and elsewhere). However in this case, I would say that regex is not the appropriate solution for you.

The problem with using regex for searching for things in HTML code is exactly the problem you've encountered -- HTML code can vary wildly, making any regex virtually impossible to get right.

It is just about possible to write a regex for your situation, but it will be an insanely complex regex, and very brittle -- ie prone to failing if the HTML code is even slightly outside the parameters you expect.

Contrast this with the recommended solution, which is to use a DOM parser. Load the HTML code into a DOM parser, and you will immediately have an object structure which you can query for individual elements and attributes.

The details you've given make it almost a no-brainer to go with this rather than a regex.

PHP has a built-in DOM parser, which you can call as follows:

$mydom = new DOMDocument;
$mydom->loadHTMLFile("http://....");

You can then use XPath to search the DOM for your specific element or attribute that you want:

$myxpath = new DOMXPath($mydom);
$myattr = $xpath->query("//td[@class="rdbot"]//img[0]@src");

Hope that helps.

Spudley
  • 166,037
  • 39
  • 233
  • 307
  • Thanks Spudley I've never used this, i'm trying to and i'm getting "Warning: DOMDocument::loadHTMLFile(): htmlParseStartTag: misplaced tag in ..." Image it's inside a link inside a td with that class "rdbot" but it's not the only td with that class – Zuker May 28 '11 at 17:23
0

Do you only need images inside of the "td" tags?

$regex='/<img src="\/images\/([^"]*)"[^>]*>/im';

edit:

to grab the specific image this should work:

$regex='/<td valign=\'top\' class=\'rdBot\' align=\'center\'>.*src="\/images\/([^"]*)".*<\/td>/
Trey
  • 5,480
  • 4
  • 23
  • 30
  • I need only the image listed there, not ALL the images. And sometimes image it's inside a link – Zuker May 28 '11 at 17:24
  • Trey this it's fro preg_match_all? Like this? $regexp = '/.*src="\/images\/([^"]*)".*<\/td>/'; preg_match_all($regexp, $buffer, $matches, PREG_SET_ORDER) – Zuker May 28 '11 at 18:11
  • yes, actually it should have an 'im' at the end of the regex : `$regex='/.*src="\/images\/([^"]*)".*<\/td>/im` – Trey May 28 '11 at 18:16
  • mmm i can't make it work, i'm trying your page and i'm not getting any matches – Zuker May 28 '11 at 18:26
  • These regex are very brittle: First regex: what is there are 2 spaces between img and src? What if there is another attribute between img and src? What if there is a space between src and "=" or between = and "? What if instead of src=" you have src='? These are just some of the most common cases that will break your regex. For the second one.... here is a run down: inverted attributes, white spaces, greedy .* before src (this will match anywhere after the first td), no allowance for " vs '... this kind of regex is the reason why people recommend going with a parser: it *will* break. – Sylverdrag May 28 '11 at 19:04
  • @Sylverdrag Yup, you are absolutely right, but in this particular instance he seems to know exactly what he wants to grab... he already knows the attributes and exact markup of the wrapping tag, so I see no reason to complicate it... I've had as many problems personally using parses because they tend to break when a dom tree is improperly nested, missing markup etc. – Trey May 28 '11 at 19:28
  • @Zuker in the test page don't add the "im", there are checkboxes on the left to check, the "i" is a flag for case insensitivity, and the "m" is a flag for "multiline" – Trey May 28 '11 at 19:29
  • @trey: Since he doesn't know whether there are tags between and the image, I doubt he knows exactly what the markup will be down to the spaces and the type of quotation marks. I understand your point about parsers. Poorly formatted HTML is a pain, but if he chooses the Regex route against invalid HTML markup, he really need to proof his expressions against all the common things that could break it. – Sylverdrag May 28 '11 at 20:00
0
function GetFilename($file) {
    $filename = substr($file, strrpos($file,'/')+1,strlen($file)-strrpos($file,'/'));
    return $filename;
}
echo GetFilename('/images/buy_tickets.gif');

This will output buy_tickets.gif

Sujit Agarwal
  • 12,348
  • 11
  • 48
  • 79
0

Parsing HTML with Regex is not recommended, as has been mentioned by several posters.

However, if the path of your images always follows the pattern src="/images/name.gif", you can easily extract it in Regex:

$pattern = <<<EOD 
#src\s*=\s*['"]/images/(.*?)["']# 
EOD;

If you are sure that the images always follow the path "/images/name.ext" and that you don't care where the image link is located in the page, this will do the job. If you have more detailed requirements (such matching only within a specific class), forget Regex, it's not the right tool for the job.


I just read in your comments that you need to match within a specific tag. Use a parser, it will save you untold headaches.

If you still want to go through regex, try this:

\(?<=<td .*?class\s*=\s*['"]rdBot['"][^<>]*?>.*?)(?<!</td>.*)<img [^<>]*src\s*=\s*["']/images/(.*?)["']\i

This should work. It does work in C#, I am not totally sure about php's brand of regex.

Sylverdrag
  • 8,898
  • 5
  • 37
  • 54