0

Here is my regex to scrap image from page.

preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches

But it fails when image url is like this:

src="//upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Adolescent_girl_sad_0001.jpg/200px-Adolescent_girl_sad_0001.jpg"

I think it need to add OR operation in above regex to allove image starting with //.

documentation says | pipe will do or operation. But how to add it in above regex?

user2129623
  • 2,167
  • 3
  • 35
  • 64
  • You already have used it successfully in the `(?:png|jpg)` part, so why not do it again? – Bergi Dec 14 '13 at 14:02
  • BTW, it would be easier to make `https?` [as a whole](http://www.regular-expressions.info/brackets.html) [optional](http://www.regular-expressions.info/optional.html) than to use some [alternatives (pipe)](http://www.regular-expressions.info/alternation.html). – Bergi Dec 14 '13 at 14:04
  • Are you looking for image links in Wikipedia pages? For those, there even is a special API: https://www.mediawiki.org/wiki/API:Properties#images_.2F_im – Bergi Dec 14 '13 at 14:06
  • @Bergi: i already tried this: `if(preg_match_all('/\bhttps?:|//\/\/\S+(?:png|jpg)\b/', $html, $matches))` which give error `Warning: preg_match_all(): Unknown modifier '/' in F:\wamp\www\img.php on line 10` – user2129623 Dec 14 '13 at 14:08
  • 1
    Is it ok to just parse out the "src" value, using '/src=([\'|"])(.+?)\1/' – Andrew Dec 14 '13 at 14:09
  • @Andrew: it is cool but I only want to parse src for image – user2129623 Dec 14 '13 at 14:11
  • 1
    If you only want png|jpg, then '/]+src=([\'"])([^>\'"]+?\.(?:png|jpg))\1/i' – Andrew Dec 14 '13 at 14:15
  • @Programming_crazy: What you were looking for is `'/\b(https?\/\/:|\/\/)\S+…` – Bergi Dec 14 '13 at 14:19
  • @Andrew: to get the result `$matches[0][0];` is ok? it give nothing – user2129623 Dec 14 '13 at 14:19
  • @Andrew [THE PONY HE COMES](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Niet the Dark Absol Dec 14 '13 at 14:19
  • Empty? See this: http://ideone.com/sHnRWx. -> string(6) "xx.jpg" – Andrew Dec 14 '13 at 14:24

1 Answers1

1

You could just avoid the wrath of the pony instead...

$dom = new DOMDocument();
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
$sources = array();
foreach($image as $img) $sources[] = $img->getAttribute("src");

Done!

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • 1
    But that might match non-png/jpg images as well… – Bergi Dec 14 '13 at 14:05
  • @Niet: What if I want to get only one image? – user2129623 Dec 14 '13 at 14:05
  • @Bergi It is trivial to add a simple "if" to check the extension. A more valid comment would have been that I put `href` instead of `src`. – Niet the Dark Absol Dec 14 '13 at 14:18
  • @Programming_crazy Depends which image you want. If it's the first, `$images->item(0)` is it. If it's an arbitrary one, change the number. In all cases, you can use the above code and then access the `$sources` array as needed. – Niet the Dark Absol Dec 14 '13 at 14:18
  • @NiettheDarkAbsol: when I returned `$images->item(0);` and echoed it, gives error `Catchable fatal error: Object of class DOMElement could not be converted to string in` – user2129623 Dec 14 '13 at 14:24
  • That's because it's a DOMElement. That's like trying to do `alert(document.getElementsByTagName('img')[0])` in JavaScript. – Niet the Dark Absol Dec 14 '13 at 14:26