0

This is my HTML:

<div class="panel-image listing-img">
      <a href="/rooms/854260?s=BD20" class="media-photo media-cover">
        <div class="listing-img-container media-cover text-center">
          <img itemprop="image" data-current="0" src="https://a2.muscache.com/ic/pictures/19208233/4d8e6c0d_original.jpg?interpolation=lanczos-none&amp;size=x_medium&amp;output-format=jpg&amp;output-quality=70"
          class="img-responsive-height" alt="Cozy room - Prague centre Old Town" data-urls="[output-format=jpg&amp;output-quality=70&quot;, &quot;https://a1.mu &quotut-format=jpg&amp;output-quality=70&quot;]">
        </div>

I want to get the src="https://a2.muscache.com/ic/pictures/19208233/4d8e6c0d_original.jpg only using Regex expression. I have tried so far

class=\"listing-img-container media-cover text-center\">\n(.*)

but it's capturing me the whole long link..

  • This is almost what i am looking for.. But my file contains more uneeded SRC's.. so if could somehow scrape only the SRC of this specifc DIV? class=\"listing-img-container media-cover text-center\">\n \\src="[^"]+" or somehow? – Ilan Bylakov Nov 24 '14 at 11:47
  • added the modified regex. – vks Nov 24 '14 at 11:51
  • Regex is not the most appropriate tool for parsing HTML. Have you considered using a structured DOM? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Sam Greenhalgh Nov 24 '14 at 12:06
  • What language are you using? – Sam Greenhalgh Nov 24 '14 at 12:09
  • Well i am scraping information from divs with regex expressions and PHP... DOM was too complex for me to work with so i am doing it with PHP and regex_scrape – Ilan Bylakov Nov 24 '14 at 12:25

2 Answers2

1
<div class="listing-img-container media-cover text-center">[\s\S]*?src="([^"]+?\.jpg)

Try this.Grab the capture.See demo.

http://regex101.com/r/zU7dA5/19

vks
  • 67,027
  • 10
  • 91
  • 124
  • Almost there! appreciate the help! can we make it untill the .jpg like https://a2.muscache.com/ic/pictures/19208233/4d8e6c0d_original.jpg?interpolation=lanczos-none&size=x_medium&output-format=jpg&output-quality=70" https://a2.muscache.com/ic/pictures/19208233/4d8e6c0d_original.jpg And exclude PNG? – Ilan Bylakov Nov 24 '14 at 11:59
  • Please have a look: http://regex101.com/r/eE7pM4/1 It scrapes now the second SRC it sees however the DIV is set to the correct one.. what could be the issue? – Ilan Bylakov Nov 24 '14 at 12:18
  • @IlanBylakov the first one you wanted had `png` not `jpg`.Soit was catching second one.I have modified the regex.see http://regex101.com/r/eE7pM4/2 – vks Nov 24 '14 at 14:23
0

Don't use regex, use a DOM parser like DOMDocument together with DOMXpath. For Xpath also have a look here.

Now put all your HTML into a DOMDocument and search within with Xpath:

$dom = new DOMDocument();
@$dom->loadHtml($html);
$xpath = new DOMXpath($dom);

$imageNodes = $xpath->query('//div[@class="listing-img-container media-cover text-center"]/img');
$src = $imageNodes->item(0)->getAttribute('src');

I turned off warnings for $dom->loadHtml() since there are some due to incorrect HTML, but this does not influence functionallity.

If you don't want the whole src but just the part before ? add

$explode = explode('?', $src, 2);
$src = $explode[0];
SBH
  • 1,787
  • 3
  • 22
  • 27