-1

I am currently using PHP to parse emails. I am able to save both attached and embedded images; however, embedded images are given an irritating "CID" source that results in a broken image link.

I want to parse these out completely, but leave images that have actual web addresses associated.

In other words, <img src = "http://example.com/images/someimage.jpg"> needs to stay. But, <img src = "cid:ii_id8bx9qh0_14f205b0a5e7738a"> needs to go.

Now, I could use strops to find the start and end, and that would be okay... except that certain email clients also embed things like width, height, and ID - and they put them in haphazard order.

So, I need a regex that looks for a start of <img, that contains src="cid, all the way to the end of the image tag.

Bonus points if it's case insensitive.

Thanks for your help!

osuddeth
  • 152
  • 9
  • ["Regex is not a tool that can be used to correctly parse HTML."](http://stackoverflow.com/a/1732454/1344955) – SeinopSys Aug 12 '15 at 06:08
  • A large portion of the internet seems to disagree. Regex is for pattern matching, no? I'm trying to match a pattern, nothing more. – osuddeth Aug 12 '15 at 06:18

2 Answers2

1

Use a proper tool for this task instead of regex.

$doc = new DOMDocument;
$doc->loadHTML($html); // load the HTML data

$xp = new DOMXPath($doc);

foreach ($xp->query('//img[contains(@src, "cid")]') as $img) {
   $img->parentNode->removeChild($img);
}

echo $doc->saveHTML();
hwnd
  • 69,796
  • 4
  • 95
  • 132
-2

Use preg_replace

preg_replace('~<img\b[^>]*src\s*=\s*"cid[^"]*"[^>]*>~i', '', $str);

or

preg_replace('~<img\b[^>]*\bsrc\s*=\s*[\'"]cid[^>]*>~i', '', $str);

i modifier helps to do a case-insensitive match.

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274