0

I'm trying to extract all the hrefs and srcs in a string like this :

$content = "
At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium
voluptatum deleniti Image: <img src = 'http://example.com/check-3.png' /> Link: <a href ='http://example.com/test.xls'>test.xls</a>";

Basically what I want to do is change example.com to a to a different domain name (say test.com) and then extract all the filenames from hrefs and srcs. I was able to do the domain name replacement with a simple str_replace but now I'm stuck trying to extract the hrefs and srcs.

Here's what I tried using :

$regex = "/src=[\"' ]?([^\"' >]+)[\"' ]?[^>]*>.*?href=[\"' ]?([^\"' >]+)[\"' ]?[^>]*>/i";

This seems to work if there is no space between src (or href) and the = (e.g. ) but if there is space (e.g. ) it does not work. I've tried adding the space character but that fails the preg match. I don't want to use a heavy library like simple HTML dom, besides i don't think it will work as its not a proper HTML document. It's a string coming out of ckeditor.

Ashesh
  • 939
  • 2
  • 11
  • 28
  • 1
    "If I had a coin each time anybody tried to parse HTML with regexes..." - I advice you to with `DomDocument` and `XPath` - see http://stackoverflow.com/questions/1933631/how-do-i-parse-partial-html. – moonwave99 Aug 29 '12 at 16:50

1 Answers1

1

Why not just add quantifiers on the space?

$regex = "/src *= *[\"' ]?([^\"' >]+)[\"' ]?[^>]*>.*?href=[\"' ]?([^\"' >]+)[\"' ]?[^>]*>/i";
               ^  ^
Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • Why is there a space after the = ? Shouldn't it be /src*=* meaning any number of spaces betfore and after = ? – Ashesh Aug 29 '12 at 17:00
  • The `*` modifies the previous character. `src *= *` means: "'src 'followed by any amount of spaces. followed by '=' followed by any amount of spaces.". `src*=*` means: "'sr' followed by any number of 'c's followed by any number of '='s". – gen_Eric Aug 29 '12 at 17:05