PHP regex to strip out images with a CID in the src tag

Question

I am currently using PHP to parse emails. I am able to save both attached and embedded images; however, embedded images are given an irritating "CID" source that results in a broken image link.

I want to parse these out completely, but leave images that have actual web addresses associated.

In other words, <img src = "http://example.com/images/someimage.jpg"> needs to stay. But, <img src = "cid:ii_id8bx9qh0_14f205b0a5e7738a"> needs to go.

Now, I could use strops to find the start and end, and that would be okay... except that certain email clients also embed things like width, height, and ID - and they put them in haphazard order.

So, I need a regex that looks for a start of <img, that contains src="cid, all the way to the end of the image tag.

Bonus points if it's case insensitive.

Thanks for your help!

["Regex is not a tool that can be used to correctly parse HTML."](http://stackoverflow.com/a/1732454/1344955) — SeinopSys, Aug 12 '15 at 06:08
A large portion of the internet seems to disagree. Regex is for pattern matching, no? I'm trying to match a pattern, nothing more. — osuddeth, Aug 12 '15 at 06:18

hwnd · Answer 1 · 2015-08-12T06:04:57.660

1

Use a proper tool for this task instead of regex.

$doc = new DOMDocument;
$doc->loadHTML($html); // load the HTML data

$xp = new DOMXPath($doc);

foreach ($xp->query('//img[contains(@src, "cid")]') as $img) {
   $img->parentNode->removeChild($img);
}

echo $doc->saveHTML();

edited Aug 12 '15 at 06:04

answered Aug 12 '15 at 05:57

hwnd

69,796
4
95
132

So, stupid question time. How do I then access the updated HTML? – osuddeth Aug 12 '15 at 06:04
I get no errors... but I still get the stupid cid images. Hrm. – osuddeth Aug 12 '15 at 06:13
Works for me no problem. – hwnd Aug 12 '15 at 06:15

Avinash Raj · Accepted Answer · 2015-08-12T06:19:22.537

-2

Use preg_replace

preg_replace('~<img\b[^>]*src\s*=\s*"cid[^"]*"[^>]*>~i', '', $str);

or

preg_replace('~<img\b[^>]*\bsrc\s*=\s*[\'"]cid[^>]*>~i', '', $str);

i modifier helps to do a case-insensitive match.

DEMO

edited Aug 12 '15 at 06:19

answered Aug 12 '15 at 05:39

Avinash Raj

172,303
28
230
274

1

Excellent. That did it! – osuddeth Aug 12 '15 at 05:48
1

Actually, I spoke too soon. Testing further, doesn't always work. – osuddeth Aug 12 '15 at 06:02
forget to escape quotes, try now. – Avinash Raj Aug 12 '15 at 06:19
Meant to come back and say that yes, the fix made it work. Thanks for your help! – osuddeth Aug 27 '15 at 05:25

PHP regex to strip out images with a CID in the src tag

2 Answers2