Why is my preg_match_all statement capturing more than it should?

Question

I am cleaning and migrating content to a new website. In some of the existing pages there are embedded images that link to files in a non-standard folder.

I am pulling the records from the database and then doing a "preg_match_all" to capture the offending items. My intention is then to clean up the filename, move the offending file and then update the database entry to reflect the new location.

However, for some reason my regex statement seems to be finding only one match (of known multiple potential hits), and sometimes seems to capture a whole load of other stuff downstream of the string I want.

This is the expression pattern I am using:

(?i)(<img.*src="uploads/RTEmagicC_(.*)")/

This is an example of content from the database that I am matching against:

BLAH BLAH BLAH<img src="uploads/RTEmagicC_Herpes_simpex_virus.jpg.jpg" alt="HSV particles" style="FLOAT: left; WIDTH: 214px; HEIGHT: 198px" title="Electron micrograph of HSV particles©NASA">blah blah blah<img src="uploads/RTEmagicC_Herpes_labialis_01.jpg.jpg" alt="Coldsore" style="FLOAT: right;" title="Cold sore on the lower lip (cluster of fluid-filled blisters = very infectious). These infections may appear on the lips, nose or in surrounding areas.©Metju12" width="238" height="178">blah blah blah

I am trying to grab: "Herpes_simpex_virus.jpg.jpg" and "Herpes_labialis_01.jpg.jpg" and the respective full links e.g.: "img src="uploads/RTEmagicC_Herpes_simpex_virus.jpg.jpg"

But it's matching a heap of downstream stuff too, beyond the " that closes the filename.

Can someone please put me out of my misery? I've tried for a few evenings on this and clearly I'm doing something stupid, but I cannot see what...

Many thanks.

Don't use regular expressions to parse HTML, use an HTML parser like `DOMDocument`. — Barmar, Nov 09 '16 at 23:51
Thank you; but I must admit that I know nothing of how to do that or the rationale behind not using regular expressions. Can you explain, or supply me with a reference please? Thanks — Chris, Nov 10 '16 at 00:43
See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Barmar, Nov 10 '16 at 00:53
@Barmar Note the 2nd highest voted answer, OP seems to know quite well what he wants to receive from the content. — Sebastian Proske, Nov 10 '16 at 01:33

Sebastian Proske · Accepted Answer · 2016-11-09T23:37:39.417

-1

By default, regex is matching greedily, so the .* matches as much as possible, including other " until the last " it can find. The same is true for the .* you use after img. You can use lazy matching, which will match as few as possible, by adding a ? to you quantifiers, so in your case this would be (?i)<img.*?src="uploads/RTEmagicC_(.*?)".

For your test string, you wouldn't need a .*?, a simple \s+ (matching one or more whitespace characters) would be sufficient - but this might not be the case for all you data. You can also replace the second .*? by [^"]*, matching any number of non-quotes.

edited Nov 09 '16 at 23:37

answered Nov 09 '16 at 23:15

Sebastian Proske

8,255
2
28
37

Thank you so much;BUT, I'm now not capturing the filename downstream of the RTEmagicC_ bit. This is what comes out: 0 => ' ' – Chris Nov 09 '16 at 23:31
Apologies - it posted before I had a chance to finish typing. – Chris Nov 09 '16 at 23:34
Thank you SO much - #(?i)( – Chris Nov 09 '16 at 23:44

Why is my preg_match_all statement capturing more than it should?

1 Answers1