0

I have a .sql dump with html content in it. I want to remove a title="...." from <img...> strings. Tricky part is that these title="....." als appears in <href.....> parts of a line.

To make it more visible I use the following strings in a 2 line file:

A B C D B C A B C
Y B C D B C Y B C

The B reprecents the title="...." part and A....C is the <img....> part

The resulting file should look like

A C D B C A C
Y B C D B C Y B C

Only the B should be removed within A...C and the seccond line should be untouched

I'm using sed because I know this best but if somebody knows a better way I'm interested to know.

Till now I've used the following command

cat file |sed '/A/ s/B/X/g'

Problem is it also replaces the B within D...C

A C D C A C
Y B C D B C Y B C

Any ideas would be appriciated.

regards,

Arjan

PS: Real life example, just one line:

nbsp;</p><p> <img src="images/vlaggen/dene_vlag.png" border="0" alt="Vlag van Denemarken" title="REMOVE THIS TITLE" width="75" height="50" align="left" />  <a href="images/hov.png" target="_blank" title="DONT REMOVE THIS TITLE"><img src="images/small.png" border="0" alt="Kaart van Denemarken" title="REMOVE THIS TITLE" align="right" /></a>   <br /><br /> </p><p>&nbsp;</p><h1>Title of page</h1>
  • One line solution :- write custom parser (search for pattern and remove subsequent string) – Syed Mohd Mohsin Akhtar Sep 23 '13 at 07:03
  • 2
    I'm afraid that unless you post an example, you'll receive a response like `sed 's/A B C/A C/g' file` for your example. – devnull Sep 23 '13 at 07:08
  • That's true. I'm aware of that. To have a real life example this is an example line with real data. Be aware the title="...." could be in other places and in other number of appearences in a line. I added an real example above. – Arjan Geertsma Sep 23 '13 at 10:55

2 Answers2

0

I think what you want here is a non-greedy regex, something which sed doesn't support. However, this question provides a potential solution. I've not tested this, but perhaps something along the following lines will help:

perl -pe 's|<img(.*?)title=".*?"(.*?)>|<img\1\2>|g'

It's early where I am, but the gist of that is "find the img tags, capture everything that isn't a title attribute, and substitute it at the end.

Community
  • 1
  • 1
chooban
  • 9,018
  • 2
  • 20
  • 36
  • It's perfect !!! That does exactly what I want. I will scan thru the rest of the file, but it looks like it is functioning very well. I realy need to dive into perl a bit more, it's very powerfull. You saved my day !!! Thank you very much! – Arjan Geertsma Sep 23 '13 at 13:11
0

I'm sure sure whether I got the problem right... but I think you need back-references, try something like this:

sed 's/\(A\) B \(C\)/\1 \2/g'

result:

A C D B C A C

Y B C D B C Y B C

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145