0

I'm converting a website to a PDF, but there are images in there and along all of them there is a text that when clicked gets you to image itself.

I think this would be the code responsible for showing that text, since I deleted it in one of the files and the text and link is not shown anymore.

<div class="v1"><a target="_self" href="images/graphics/1.jpg">[View full size image]</a></div>

The problem is that there are about 200 more HTML documents containing this similar text, only changing href.

Would there be any easy way to get rid of all this without having to go one by one? Maybe a regular expression for sed?

James Russell
  • 339
  • 1
  • 3
  • 12

2 Answers2

1

If the expression is always on one line and the only difference is in href, sed is a possible solution:

sed -e 's,<div class="v1"><a target="_self" href="[^"]*">\[View full size image\]</a></div>,,' 

I used an alternative separator , so / does not have to be escaped in closing tags. The brackets in the links's text need to be escaped, though.

choroba
  • 231,213
  • 25
  • 204
  • 289
  • Thank you for the answer, I marked as accepted the other one because it was the one I read and used; but this one is as valid as the other one. – James Russell Oct 23 '12 at 10:17
0

Yes, regular expressions are likely the easiest solution here. If it's simply a question of removing this line from all your files then I'd just open them up in an editor (Sublime Text 2 does this well) and perform a regex search and replace. The following search pattern will likely work:

<div class=\"v1\"><a target=\"_self\" href=\"[^"]+\">\[View full size image\]</a></div>

Simon
  • 3,667
  • 1
  • 35
  • 49