0

Recently I had to extract all img-Tags from a given HTML-String in Php. After some time googling around I was able to solve the problem with the following statement:

preg_match_all('/<img(.*?)\/>/s', $content, $images);

Though I have a rough clue about regular expressions in Php I wasn't able to figure out why (.*?) can be used as a placeholder between certain strings (in this case 'img' and '/>').

So can anyone give me a prober explanation of the regular expression (.*?)?

magic_al
  • 1,930
  • 1
  • 18
  • 26

3 Answers3

5

. matches any character. * tells the regex engine to match any number of those characters. ? in this context is the lazy quantifier, which means "make the match as small as possible". (For a more precise description, see this excellent answer.)

In effect, /<img(.*?)\/>/ means "start with matching <img, then continue matching any character until the first /> is found".

Community
  • 1
  • 1
lonesomeday
  • 233,373
  • 50
  • 316
  • 318
  • Ok got it. The explanation of '?' did it. Thanks a lot. – magic_al Feb 10 '14 at 12:01
  • 1
    "Make the match as small as possible" is a bit misleading: `A.*?C` will match the entire string `"ABBBBBBABC"` and not just `"ABC"`. – Tim Pietzcker Feb 10 '14 at 12:02
  • @TimPietzcker Fair point. Linked to a question that specifically addresses this point. – lonesomeday Feb 10 '14 at 12:06
  • The correct way of putting it is "as few times as possible". If that means it has to match the entire thing before getting to `C`, then so be it! – Vasili Syrakis Feb 10 '14 at 12:15
  • @magic_al If this (or any other) answer resolved your issue, you should consider accepting it. This indicates to other users that the issue has been resolved and helps future visitors know that the answer may be useful to them as well. – p.s.w.g Mar 13 '14 at 18:50
1

The group (.*?) can be explained as follows:

(     // Beginning of the group
 .    // Represents any character (one character)
 *    // 'Repeats' the previous expression 0 to infinite times, equivalent to {0,}
 ?    // 'Repeats' the previous expression 0 or 1 times**
)     // End of the group

** This means effectively, that in between <img and /> there maybe be some characters (.*) or not (?), where .* means any character, 0 or more times in the string.

As a result, this regular expression matches various strings like <img/>, <img src="..." alt=""/> just to list a few examples.

http://www.php.net/manual/en/reference.pcre.pattern.syntax.php This link provides a good manual on regular expression in PHP if you want to read up on the topic.

1
  • . means "any character that is not a line break"
  • * is a 'quantifier' which means "between 0 and unlimited times, as many times as possible"
  • ? is a lazy quantifier, which turns the * into "as few times as possible"
  • The () wrapped around the expression turns it into a "capture group" which can be referred to using a backreference \1 or $1 for further processing.

The full expression (.*?) means:

Match the following into a capture group with backreference 1:
Match any character that is not a line break, between 0 and unlimited times, as few times as possible, giving back as needed.

Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56