0

I've got img tag in my text and I want to get the name of the file from src

So I use this code

preg_match_all("|\/img\/(.*)\/>|U", $article_header, $matches, PREG_PATTERN_ORDER);
echo "match=".$matches[1][0]."<br/>";

Doing so I get this as a result

match=500.JPG\" alt=\"\" width=\"500\" height=\"360\"

So in this case I use "\/>" which means the end of tag.

But I want only name of the file "500.JPG" So I must use "\" but when I do it

    preg_match_all("|\/img\/(.*)\\|U", $article_header, $matches, PREG_PATTERN_ORDER);

I get no matches :( Please help

With the help of yes123 I did this

$doc = new DOMDocument();
$doc->loadHTML($article_header);

$imgs = $doc->getElementsByTagName('img');
$img_src = array();
foreach ($imgs as $img) {
// Store the img src
$img_src[] = $img->getAttribute('src');
echo $img_src[0];
}

which gives me this

\"sources/public/users/qqqqqq/articles/2011-06-11/7/img/500.JPG\"

But now anyway I want only 500.JPG from this

So what is the right regexp ?

David
  • 4,332
  • 13
  • 54
  • 93

5 Answers5

4

To match a real backslash-char in regex, you have to 'double-escape' it, that means 4 backslashes to match a single backslash: \\\\

preg_match_all("|/img/(.*)\\\\|U", ...);
Floern
  • 33,559
  • 24
  • 104
  • 119
0

use php function pathinfo

http://php.net/manual/en/function.pathinfo.php

pathinfo($img_src[0]);

result

Array
(
    [dirname] => sources/public/users/qqqqqq/articles/2011-06-11/7/img/
    [basename] => 500.JPG
    [extension] => JPG
    [filename] => 500
)
merakli
  • 203
  • 2
  • 8
0

You can't parse HTML with regex.

Use DOMDocument

// HTML already parsed into $dom
$imgs = $dom->getElementsByTagName('img');
$img_src = array();
foreach ($imgs as $img) {
  // Store the img src
  $img_src[] = $img->getAttribute('src');

}

Don't forget you can always search google or stackoverflow before opening a question

dynamic
  • 46,985
  • 55
  • 154
  • 231
  • can u write example to my question ? – David Jun 11 '11 at 15:05
  • He's not asking to parse a full HTML document tree, or any type of nested structure. If you just need to extract a src attribute from an image tag, regex will work fine. – Evert Jun 11 '11 at 15:16
  • @Evert i guess you didn't read this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 this applies to whatever it's just a DOMElement or an HTML page. What about a src with single quote or even without any quote? Also with the space before and after the equal sign? Write a regex for that and I will write an img your regex will not catch – dynamic Jun 11 '11 at 15:18
  • maybe I'm too noob but this code crashes $imgs = $article_header->getElementsByTagName('img'); $img_src = array(); foreach ($imgs as $img) { // Store the img src $img_src[] = $img->getAttribute('src'); } – David Jun 11 '11 at 15:22
  • @david you need to instanciate the domdocument.http://it.php.net/manual/en/domdocument.loadhtml.php – dynamic Jun 11 '11 at 15:23
  • guys we are PHP right ? not javascript. U sure those getElementsByTagName and getAttribute are working ? My dreamviewer recognises them as functions, and I think they must be declared somewhere before use , am I wrong ? Don't know why but your code yes123 is not working :( – David Jun 11 '11 at 15:26
  • ok thank you yes 123 I think I will try. I will get src by your method but will apply preg_match to the source later to get only the name of the file which I need. without the whole source. – David Jun 11 '11 at 15:28
  • OK I've done as you've said but ended with this > \"sources/public/users/qqqqqq/articles/2011-06-11/7/img/500.JPG\" I understand that u're right and it is good to parse html dom first to get the correct src but now I want only 500.JPG from this. So how can I regexp it ? – David Jun 11 '11 at 15:34
  • @yes123: I know the other answer, and also know why this message is so famously repeated. The main point of the answer is that standard regex is not turing complete, and therefore cannot be recursive. It's definitely possible to extract a simple, single string. It's easy to match on quotes. Sure there might be extreme edge-cases a simple regex cannot parse, but that doesn't mean most standard cases will parse fine. – Evert Jun 12 '11 at 15:59
0
preg_match_all('/<img[^>*]src="([^"]+)".*>/Uis', $article_header, $matches)
d4rky
  • 469
  • 6
  • 13
0

Try something like, I tested it now:

$article_header = 'foo <img src=\\"sources/public/users/qqqqqq/articles/2011-06-11/7/img/500.JPG\\" /> foo';
preg_match_all('|<img[^>]+?src="[^"]*?([^/"]+?)"|', stripslashes($article_header), $matches, PREG_PATTERN_ORDER);
echo "match=".$matches[1][0]."<br/>";

It seems that you have $article_header with slashes (that was a bit irritating), so I added an stripslashes().

flori
  • 14,339
  • 4
  • 56
  • 63
  • Sorry, didn't got that you have the data with slashes and that you need just the name of the file (not the path). As you see I improved my answer. – flori Jun 11 '11 at 15:57