I know there are many questions on this topic, but most are fairly trivial and I'm unable to find a solution for my case.
I have a set of HTML files with many, many "media" items like the following, each of which is a "paragraph", separated by "\n\n". Here is a link to a sample file of the type I'm working on.
<li class="media">
<div class="media-left">
<a href="#">
<img class="media-object" src="4_17-HE-assoc.png" width="250" alt="...">
</a>
</div>
<div class="media-body">
<h4 class="media-heading">Figure 4.17</h4>
Association plot for the hair-color eye-color data. Left: marginal table, collapsed over
gender; right: full table.
</div>
</li>
For each <img ...>
tag, I need to find the src="file"
value, and replace the href="#"
on the previous line
by href="file" class="fancybox
. i.e., so that item will then look like
<li class="media">
<div class="media-left">
<a href="4_17-HE-assoc.png" class="fancybox">
<img class="media-object" src="4_17-HE-assoc.png" width="250" alt="...">
</a>
</div>
<div class="media-body">
<h4 class="media-heading">Figure 4.17</h4>
Association plot for the hair-color eye-color data. Left: marginal table, collapsed over
gender; right: full table.
</div>
</li>
I tried the following as a one-liner, but it has no effect, i.e., it doesn't make the changes.
perl -pi~ -e '$/ = "";s|<a href="#">\n(\s*<img class="media object") src=(".*png")|<a class="fancybox" href="\2">\n\1 src=\2|ms' ch03.html
Can someone help with this? I'd be happy with a simple script that I could use for this and modify for other similar modifications of a collection of web files.
edit: I'm aware of the advantages of using perl modules such as HTML::TreeBuilder
to avoid having to parse HTML directly. If someone
could give me a start, I could probably take it from there.