Extracting specific URLs out of the document

Question

I think this should be elementary, but I still can't get my head around it. Let's say there's fair amount of HTML documents and I need to catch every image URLs out of them.

The rest of the content changes, but the base of the url is always the same for example: http://images.examplesite.com/images/,

So I want to extract every string that contains that part. the problem is that they're always mixed with <a href=''> or <img src=''> tags, so how could I drop them out? preg_match probably?

possible duplicate of [PHP Xpath : get all href values that contain needle](http://stackoverflow.com/questions/2392393/php-xpath-get-all-href-values-that-contain-needle) — Gordon, Jul 20 '10 at 07:42
You can also use DOM as shown in [Preg_Match All A href](http://stackoverflow.com/questions/1519696/preg-match-all-a-href/1519791#1519791). Just change the XPath to the one given in the linked duplicate. — Gordon, Jul 20 '10 at 07:48

score 1 · Accepted Answer · answered Jul 20 '10 at 07:40

1

Try something like: preg_match_all('/http:\/\/images\.examplesite\.com\/images\/(.*?)"/i', $html_data, $results, PREG_SET_ORDER)

answered Jul 20 '10 at 07:40

Narcis Radu

2,519
22
33

wow, that was fast. it leaves one " after the string, but believe it or not I got rid of it myself ;D thanks again! – Seerumi Jul 20 '10 at 07:57

score 0 · Answer 2 · answered Jul 20 '10 at 07:43

0

You can either use html dom parser

or use regular expression.

  preg_match_all("/http:\/\/images.examplesite.com\/images\/(.*?)\"/s", $str, $preg);
  print_r($preg);

answered Jul 20 '10 at 07:43

ahmetunal

3,930
1
23
26

Extracting specific URLs out of the document

2 Answers2