0

I want search and remove proprietary tags inside HTML image tag.

I want to remove the following attributes from each IMG tag: data-base-url, data-linked-resource-default-alias, data-linked-resource-container-id, data-image-, data-linked-resource-id, and data-linked-resource-type.

So I'm trying create Regular expression for Notepad++ search, to search this code and remove.

Image code examples:

<img data-base-url="http://doc.webdomain.com" data-image-="" data-linked-resource-container-id="5374312" data-linked-resource-default-alias="fo005-categories.png" data-linked-resource-id="11468806" data-linked-resource-type="attachment" src="http://doc.musicbox.com/download/attachments/5374312/fo005-categories.png?version=1&amp;modificationDate=1344416572000" title="Musicbox 1.9 &gt; Browsing the front-office &gt; fo005-categories.png" />


<img data-base-url="http://doc.webdomain.com" data-image-="" data-linked-resource-container-id="5374312" data-linked-resource-default-alias="fo008-suppliers.png" data-linked-resource-id="11468815" data-linked-resource-type="attachment" src="http://doc.musicbox.com/download/attachments/5374312/fo008-suppliers.png?version=1&amp;modificationDate=1344416588000" title="Musicbox 1.9 &gt; Browsing the front-office &gt; fo008-suppliers.png" />

I want get this image code(with added alt attribute, and truncated src attribute value):

<img src="http://doc.musicbox.com/download/attachments/5374312/fo008-suppliers.png" title="" alt="" />

How to write this expression?

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – Laurent S. Jul 10 '13 at 12:09
  • [These attributes are not proprietary](http://www.w3.org/TR/html51/dom.html#embedding-custom-non-visible-data-with-the-data-*-attributes) – Explosion Pills Jul 10 '13 at 12:10
  • 1
    This has been discussed on [meta](http://meta.stackexchange.com/questions/188408/give-me-teh-regez-questions) – user000001 Jul 11 '13 at 14:48

2 Answers2

2

Find :

<img.+src="(.+)" title="(.+)" />

Replace with :

<img src="\1" title="\2" alt="" />
DarkBee
  • 16,592
  • 6
  • 46
  • 58
  • Yes, this combination works fine. Thank You! Still one note: I want remove extra data in images URLs - anything after .png, so that from `http://doc.musicbox.com/download/attachments/5374312/fo011-newsletter.png?version=1&modificationDate=1344416606000` to `http://doc.musicbox.com/download/attachments/5374312/fo011-newsletter.png` –  Jul 10 '13 at 12:41
  • @sonex If his works, this should for what you asked in your comment: `` – acdcjunior Jul 10 '13 at 16:33
2

Description

This regex will:

  • pull the src, alt, width, and title attributes from all image tags
  • skip over potentially problematic attributes
  • allow the attributes to appear in any order
  • for the src attribute, only use upto but not including the first ?

Regex:

<img\b(?=\s) # capture the open tag
(?=(?:(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s(src=["][^"]*?)[?"])?)  # find the src attribute and truncate at at the first `?`
(?=(?:(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s(alt=["][^"]*["]))?)  # find the alt attribute
(?=(?:(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s(title=["][^"]*["]))?)  # find the title attribute
(?=(?:(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s(width=["][^"]*["]))?)  # find the width attribute
(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?> # get the entire  tag

Replace with: <img $1" $2 $3 $4 />

The " after $1 is required due to how the src at needs to truncate at the first ? symbol.

In Notepad++

Sample Text

Note in the second image tag I added a potentially problematic attribute.

<img data-base-url="http://doc.webdomain.com" data-image-="" data-linked-resource-container-id="5374312" data-linked-resource-default-alias="fo005-categories.png" data-linked-resource-id="11468806" data-linked-resource-type="attachment" src="http://doc.prestashop.com/download/attachments/5374312/fo005-categories.png?version=1&amp;modificationDate=1344416572000" title="Musicbox 1.9 &gt; Browsing the front-office &gt; fo005-categories.png" />


<img onmouseover=' src="BAD.IMAGE.PNG" ; funImageSwap(src) ; ' data-base-url="http://doc.webdomain.com" data-image-="" data-linked-resource-container-id="5374312" data-linked-resource-default-alias="fo008-suppliers.png" data-linked-resource-id="11468815" data-linked-resource-type="attachment" src="http://doc.prestashop.com/download/attachments/5374312/fo008-suppliers.png?version=1&amp;modificationDate=1344416588000" title="Musicbox 1.9 &gt; Browsing the front-office &gt; fo008-suppliers.png" />

Find What: <img\b(?=\s)(?=(?:(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s(src=["][^"]*?)[?"])?)(?=(?:(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s(alt=["][^"]*["]))?)(?=(?:(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s(title=["][^"]*["]))?)(?=(?:(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\s(width=["][^"]*["]))?)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>

Replace with: <img $1" $2 $3 $4 />

There where problems with notepad++ regular expressions in previous versions. This works in 6.3.3 and 6.4.2. However in the later versions the popup dialog box describing the number of replacements has been changed to line of text just under the replace window (next to the arrow in the image)

enter image description here

Infinite Recursion
  • 6,511
  • 28
  • 39
  • 51
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • Didn't worked for me, notepad++ 6.4.2 –  Jul 10 '13 at 14:32
  • Did you use the highlighted regex in the notepad secton above or the the one in the description section? Also did you configure the find and replace screen as shown in the image? – Ro Yo Mi Jul 10 '13 at 15:24
  • I downloaded and installed 6.4.2 and this worked for me. So I updated the answer to show that this works with the latest version. And I updated the expression to the undesirable text in the src value. – Ro Yo Mi Jul 10 '13 at 15:55
  • Sorry, my bad. I didn't selected 'Regular expression' in Search Mode. Works correctly. Thanks a lot. –  Jul 10 '13 at 16:59
  • I'm trying to modify regex for `` –  Jul 10 '13 at 17:14
  • to get `` –  Jul 10 '13 at 17:14
  • I updated the expression to also capture the width attribute. – Ro Yo Mi Jul 10 '13 at 18:16