0

I know there are many questions on this topic, but most are fairly trivial and I'm unable to find a solution for my case.

I have a set of HTML files with many, many "media" items like the following, each of which is a "paragraph", separated by "\n\n". Here is a link to a sample file of the type I'm working on.

  <li class="media">
    <div class="media-left">
      <a href="#">
        <img class="media-object" src="4_17-HE-assoc.png" width="250" alt="...">
      </a>
    </div>
    <div class="media-body">
      <h4 class="media-heading">Figure 4.17</h4>
      Association plot for the hair-color eye-color data. Left: marginal table, collapsed over
      gender; right: full table.
    </div>
  </li>

For each <img ...> tag, I need to find the src="file" value, and replace the href="#" on the previous line by href="file" class="fancybox. i.e., so that item will then look like

  <li class="media">
    <div class="media-left">
      <a href="4_17-HE-assoc.png" class="fancybox">
        <img class="media-object" src="4_17-HE-assoc.png" width="250" alt="...">
      </a>
    </div>
    <div class="media-body">
      <h4 class="media-heading">Figure 4.17</h4>
      Association plot for the hair-color eye-color data. Left: marginal table, collapsed over
      gender; right: full table.
    </div>
  </li>

I tried the following as a one-liner, but it has no effect, i.e., it doesn't make the changes.

perl -pi~ -e '$/ = "";s|<a href="#">\n(\s*<img class="media object") src=(".*png")|<a class="fancybox" href="\2">\n\1 src=\2|ms' ch03.html

Can someone help with this? I'd be happy with a simple script that I could use for this and modify for other similar modifications of a collection of web files.

edit: I'm aware of the advantages of using perl modules such as HTML::TreeBuilder to avoid having to parse HTML directly. If someone could give me a start, I could probably take it from there.

user101089
  • 3,756
  • 1
  • 26
  • 53
  • 3
    Before any other user does it: [obligatory warning to use regular expressions for html tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Jan Jan 20 '16 at 17:56
  • 2
    Stop. Take a step back. Forget that you wanted to use regular expressions to parse HTML. Now [do it properly](http://stackoverflow.com/search?q=%5Bperl%5D+parse+html). – Matt Jacob Jan 20 '16 at 18:30

2 Answers2

7
use XML::LibXML qw( );

my $qfn = 'ch03.html';

my $in_qfn = $qfn . "~";
my $out_qfn = $qfn;
rename($qfn, $in_qfn)
   or die("Can't rename \"qfn\": $!\n");

my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_file($in_qfn);

for my $a_node ($doc->findnodes('//a[@href="#"]')) {
   my ($src_node) = $a_node->findnodes('img[1]/@src')
      or next;

   $a_node->setAttribute('href', $src_node->value());
   $a_node->setAttribute('class', 'fancybox');
}
my $html = $doc->toStringHTML();
open(my $fh, '>', $out_qfn)
   or die("Can't create \"$out_qfn\": $!\n");

print($fh $html);

Tested:

$ diff -u ch03.html{~,}
--- ch03.html~  2016-01-20 12:41:30.809203040 -0800
+++ ch03.html   2016-01-20 12:41:31.009201042 -0800
@@ -1,7 +1,7 @@
-<div class="contents">
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
+<html><body><div class="contents">
 <h1 class="tocpage">Chapter 3: Fitting and Graphing Discrete Distributions</h1>
 <hr class="tocpage">
-
 <div class="row">
   <div class="col-md-6">
     <!-- prelude-inserted  -->
@@ -18,7 +18,7 @@
   <div class="col-md-6">
     <h3>Contents</h3>
     <dl class="chaptoc">
-        <dd>3.1. Introduction to discrete distributions</dd>
+<dd>3.1. Introduction to discrete distributions</dd>
         <dd>3.2. Characteristics of discrete distributions</dd>
         <dd>3.3. Fitting discrete distributions</dd>
         <dd>3.4. Diagnosing discrete distributions: Ord plots</dd>
@@ -27,8 +27,7 @@
         <dd>3.7. Chapter summary</dd>
         <dd>3.8. Lab exercises</dd>
     </dl>
-
-  </div>
+</div>
 </div>

 <!-- more-content -->
@@ -38,11 +37,10 @@
        <h3>Selected figures</h3>
      <a class="btn btn-primary" href="../../Rcode/ch03.R" role="button">view R code</a>
     <ul class="media-list">
-      <li class="media">
+<li class="media">
         <div class="media-left">
-          <a href="#">
-            <img class="media-object" src="saxony-barplot.png" width="250" alt="males in Saxony families">
-          </a>
+          <a href="saxony-barplot.png" class="fancybox">
+            <img class="media-object" src="saxony-barplot.png" width="250" alt="males in Saxony families"></a>
         </div>
         <div class="media-body">
           <h4 class="media-heading">Figure 3.2</h4>
@@ -52,9 +50,8 @@

       <li class="media">
         <div class="media-left">
-          <a href="#">
-            <img class="media-object" src="dbinom2-plot2-1.png" width="250" alt="Binomial distributions">
-          </a>
+          <a href="dbinom2-plot2-1.png" class="fancybox">
+            <img class="media-object" src="dbinom2-plot2-1.png" width="250" alt="Binomial distributions"></a>
         </div>
         <div class="media-body">
           <h4 class="media-heading">Figure 3.9</h4>
@@ -64,9 +61,8 @@

       <li class="media">
         <div class="media-left">
-          <a href="#">
-            <img class="media-object" src="dpois-xyplot2-1.png" width="250" alt="Poisson distributions">
-          </a>
+          <a href="dpois-xyplot2-1.png" class="fancybox">
+            <img class="media-object" src="dpois-xyplot2-1.png" width="250" alt="Poisson distributions"></a>
         </div>
         <div class="media-body">
           <h4 class="media-heading">Figure 3.11</h4>
@@ -76,9 +72,8 @@

       <li class="media">
         <div class="media-left">
-          <a href="#">
-            <img class="media-object" src="Fed0-plots2-1.png" width="250" alt="Hanging rootogram">
-          </a>
+          <a href="Fed0-plots2-1.png" class="fancybox">
+            <img class="media-object" src="Fed0-plots2-1.png" width="250" alt="Hanging rootogram"></a>
         </div>
         <div class="media-body">
           <h4 class="media-heading">Figure 3.15</h4>
@@ -89,9 +84,8 @@

       <li class="media">
         <div class="media-left">
-          <a href="#">
-            <img class="media-object" src="ordplot1-1.png" width="250" alt="Ord plot for the Butterfly data">
-          </a>
+          <a href="ordplot1-1.png" class="fancybox">
+            <img class="media-object" src="ordplot1-1.png" width="250" alt="Ord plot for the Butterfly data"></a>
         </div>
         <div class="media-body">
           <h4 class="media-heading">Figure 3.18</h4>
@@ -100,9 +94,10 @@
         </div>
       </li>

-    </ul> <!-- media-list -->
-  </div> <!-- col-md-12 -->
+    </ul>
+<!-- media-list -->
+</div> <!-- col-md-12 -->
 <!-- footer -->
 </div>  <!-- row -->

-</div>
+</div></body></html>
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I can see what this is trying to do, but it doesn't work for my test file and I can't see why. In particular, the `href` and `class` attributes are not replaced in my `` – user101089 Jan 20 '16 at 20:18
  • 1
    This code was written to solve the problem you asked, and it was tested against both the file in the question and the file to which the question links to make sure it does work. If you use a file in a different format than the one provided in the question, you'll need to adjust the code. If you adjust the file in your question to be more like your real file, and you let me know that you've done this by leaving a comment here, I'll update my answer. – ikegami Jan 20 '16 at 20:26
  • I greatly appreciate your help with this, and haven't changed the file format at all. My test file, `ch03.html` was added to the original question, and is given here: https://raw.githubusercontent.com/friendly/DDAR/master/pages/chapters/ch03.html – user101089 Jan 20 '16 at 20:35
  • 1
    Again, it works for that file. My answer now includes the result. – ikegami Jan 20 '16 at 20:43
  • OK, I accept that your answer works for you. My perl (v5.14.2) has many modules out of date, so I'll have to upgrade them first to test. – user101089 Jan 20 '16 at 22:15
1

I couldn't resist but write this one-off, super unstable, sends-me-to-parse-html-with-regex-hell sed command:

sed -i.bak '/<a href="#"/ {
    N
    /\n.*<img class=/ {
        s/^\( *<a href="\).*\(\n.*src="\)\([^"]*\)\(.*\)/\1\3" class="fancybox">\2\3\4/
    }
}' ch03.html

This looks for a line with href="#", appends the next line and then substitutes the filename and fancybox into the a tag.

Diffing the result and the input file:

43c43
<           <a href="#">
---
>           <a href="saxony-barplot.png" class="fancybox">
55c55
<           <a href="#">
---
>           <a href="dbinom2-plot2-1.png" class="fancybox">
67c67
<           <a href="#">
---
>           <a href="dpois-xyplot2-1.png" class="fancybox">
79c79
<           <a href="#">
---
>           <a href="Fed0-plots2-1.png" class="fancybox">
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
  • 1
    I am always ambivalent when someone does this - because whilst it's technically correct, and thus doesn't really deserve a downvote, neither do I think I could in all conscience recommend someone actually use it. – Sobrique Jan 21 '16 at 12:11
  • However, it does **exactly** what I asked for, and doesn't mess up the formatting of the rest of the file as does the `XML::LibXML` solution. – user101089 Jan 21 '16 at 13:17