1

I have a body from a post where the user can or can't insert an image. I need to retrieve each occurrence of that image on the post. That is the pattern:

<img src="/storage/USER_ID/articles/pictures/FILENAME">

So lets say I will have this body:

$body = "... Cras ut tristique est. Etiam porttitor elit velit, vitae consequat eros interdum ac. Nam in blandit ante.</p><p>&nbsp;</p><figure class="image"><img src="/storage/5/articles/pictures/1560534410321_a363bc0d804aec432567128ed10416ee.jpeg"></figure><p>Integer sed justo accumsan, consequat nulla at, tincidunt massa. Integer orna Etiam porttitor elit velit, vitae consequat eros interdum ac. Nam in blandit ante.</p><p>&nbsp;</p><figure class="image"><img src="/storage/5/articles/pictures/23456410321_a33456t604aec432567128ed10416ee.jpeg"></figure> j hgfjhf  jfhfj hgf jh786 876 8 76fgj tfyt u  ufgi uyu y gi iy gygg ...";

I want to retrieve the number 5 and the filename 1560534410321_a363bc0d804aec432567128ed10416ee.jpeg

and number 5 and the filename 23456410321_a33456t604aec432567128ed10416ee.jpeg

So in this scenario I think the pattern should be like this: retrieve any number and filename between <img src="/storage/ number /articles/pictures/ filename ">

This is what i have so far:

preg_match_all ('/<img src=\"\/storage\/(.*?)\/articles\/pictures\/(.*?)\.(.*?)\"\>/g', $body , $result);

How can I improve this REGEX to have a scenario where " is replaced by '?

IgorAlves
  • 5,086
  • 10
  • 52
  • 83
  • 1
    Switch to HTML DOM parsing, it looks like the best moment. See [How do you parse and process HTML/XML in PHP?](https://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) – Wiktor Stribiżew Jun 14 '19 at 20:59

5 Answers5

1

You should use HTML DOM parsing and then REGEX.

DomDocument is a good example of a built-in library that is easy to set up.

You can use this to get the string value of an image's src property:

<?php

// Create a DOM object from a string
$dom = new DOMDocument;
$dom->loadHTML($string);

// Find all <img> with the id=foo attribute
$images = $dom->getElementsByTagName('img');

//Loop through all the images and print their 'src' tag
foreach ($images as $image) {
    echo $image->getAttribute('src');
}

?>

While this library is fairly limited, it will keep your code base small and relatively efficient without having to download anything. :)

After parsing the html, you could use regex as one of the many ways to get your desired information from the file path.

The following regex expression would work to simplify a string into the different parts of USER_ID and FILENAME.

DEMO

<?php

$string = "/storage/5/articles/pictures/1560534410321_a363bc0d804aec432567128ed10416ee.jpeg";

// Perform Regex
$array = preg_match('\/storage\/(\d+)\/articles\/pictures\/((?:[\S\s])*)', $string);

$user_id = $array[1];
$filename = $array[2];

?>
Community
  • 1
  • 1
shn
  • 865
  • 5
  • 14
  • Not a DOM, but a special (and limited) library. – ThW Jun 14 '19 at 21:06
  • 1
    That library is not recommended. Its codebase is more than a decade old. It's outdated and fragile. Much better alternatives exist, especially PHP's built-in DomDocument or SimpleXML. – miken32 Jun 14 '19 at 22:07
  • 1
    Thanks for letting me know. @miken32 – shn Jun 14 '19 at 22:31
1

Avoid parsing HTML with regex.

It's better to first narrow down to the values you need then do some regex on them if you need.

<?php
$body = '...';

$dom_err = libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHtml($body, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

$imgs = [];
foreach ($xpath->query("//figure/img") as $img) {
    $src = $img->getAttribute('src');

    if (preg_match('#/storage/(.*)/articles/pictures/(.*)#', $src, $result)) {
        $imgs[] = [
            'id' => $result[1],
            'name' => $result[2]
        ];
    }
}

libxml_clear_errors();
libxml_use_internal_errors($dom_err);

print_r($imgs);

Result:

Array
(
    [0] => Array
        (
            [id] => 5
            [name] => 1560534410321_a363bc0d804aec432567128ed10416ee.jpeg
        )

    [1] => Array
        (
            [id] => 5
            [name] => 23456410321_a33456t604aec432567128ed10416ee.jpeg
        )

)

Demo

Emma
  • 27,428
  • 11
  • 44
  • 69
Lawrence Cherone
  • 46,049
  • 7
  • 62
  • 106
-1

Here, we would be using a simple expression with preg_match_all:

src=".*?([^\/]+\.[a-z]+)?"

and our desired output is in this capturing group:

([^\/]+\.[a-z]+)

Demo

Test

$re = '/src=".*?([^\/]+\.[a-z]+)?"/m';
$str = '... Cras ut tristique est. Etiam porttitor elit velit, vitae consequat eros interdum ac. Nam in blandit ante.</p><p>&nbsp;</p><figure class="image"><img src="/storage/5/articles/pictures/1560534410321_a363bc0d804aec432567128ed10416ee.jpeg"></figure><p>Integer sed justo accumsan, consequat nulla at, tincidunt massa. Integer orna Etiam porttitor elit velit, vitae consequat eros interdum ac. Nam in blandit ante.</p><p>&nbsp;</p><figure class="image"><img src="/storage/5/articles/pictures/23456410321_a33456t604aec432567128ed10416ee.jpeg"></figure> j hgfjhf  jfhfj hgf jh786 876 8 76fgj tfyt u  ufgi uyu y gi iy gygg ...';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

foreach ($matches as $key => $value) {
    echo $value[1] . "\n";
}

Output

1560534410321_a363bc0d804aec432567128ed10416ee.jpeg
23456410321_a33456t604aec432567128ed10416ee.jpeg
Emma
  • 27,428
  • 11
  • 44
  • 69
-1

This works

<img(?=\s)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\ssrc\s*=\s*(?:(['"])(?:(?!\1)[\S\s])*?/storage/(\d+)/articles/pictures/((?:(?!\1)[\S\s])*)\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

The Number is in group 2, the Filename is in group 3.

https://regex101.com/r/4oSMXl/1

Explained

 # Begin open img tag

 < img
 (?= \s )
 (?=                    # Asserttion (a pseudo atomic group)
      (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
      \s src \s* = \s*       # Src Attribute
      (?:
           ( ['"] )               # (1), Quote
           (?:                    # Src Value
                (?! \1 )
                [\S\s] 
           )*?

           /storage/
           ( \d+ )                # (2), Number
           /articles/pictures/

           (                      # (3 start), Filename, general to end of string
                (?:
                     (?! \1 )
                     [\S\s] 
                )*
           )                      # (3 end)
           \1                     # End Quote
      )
 )
                        # Have the code, just match the rest of tag
 \s+ 
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+

 >                      # End img tag
-1

Here are two points:

If you try to extract information from HTML/XML use the matching parser. Most of the time that means DOM. You can use Xpath expressions to fetch nodes. This is limited to because PHP only supports Xpath 1.0 and this mean only simple string functions. However your can break that limits registering and calling PHP functions from Xpath.

$html = <<<'HTML'
<img src="/storage/USER_ID/articles/pictures/FILENAME">
HTML;

$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);

$expression = '//img[starts-with(@src, "/storage/")]';

foreach ($xpath->evaluate($expression) as $imageNode) {
    var_dump($imageNode->getAttribute('src'));
}

Output:

string(43) "/storage/USER_ID/articles/pictures/FILENAME"

This is the better way. The parser will take care of format specifics like the quotes or decoding entities.

However if your really like/need to use RegEx - a PCRE pattern matching alternative characters is easy, just use a character class like (?<quote>["']) or an alternative pattern like (?<quote>"|') wrapped into a named pattern. With that you can reference it for a closing quote. Here is a condensed example:

$pattern = '((?<quote>[\'"])(?<content>.*)?\g{quote})';
$subject = <<<'DATA'
'foo' "bar"
DATA;

preg_match_all($pattern, $subject, $matches);
var_dump($matches['content']);

Output:

array(2) { 
  [0]=> 
  string(3) "foo" 
  [1]=> 
  string(3) "bar" 
}
Emma
  • 27,428
  • 11
  • 44
  • 69
ThW
  • 19,120
  • 3
  • 22
  • 44