Editing multiple HTML files using SED (or something similar)

Question

I have about 1000 HTML files to edit which represent footnotes in a large technical document. I have been asked to go through the HTML files one by one and manually edit the HTML, to get it all on the straight and narrow.

I know that this could probably be done in a matter of seconds with SED as the changes to each file are similar. The body text in each file can be different but I want to change the tags to match the following:

<body>
<p class="Notes">See <i>R v Swain</i> (1992) 8 CRNZ 657 (HC).</p>
</body>

The text may change, for example, it could say 'See R v Pinky and the Brain (1992) or something like that but basically the body text should be that.

Currently, however, the body text may be:

<body>
<p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span 
  class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Pinky and the Brain</i> (1992) </span></span></span></span></span></p>
</body>

or even:

<body>
<p class="FootnoteText"><span class="FootnoteReference"><span lang="EN-US" 
  xml:lang="EN-US" style="font-size: 10.0pt;"><span><![endif]></span></span></span>See <i>R v Pinky and the Brain</i> (1992)</p>
</body>

Can anybody suggest a SED expression or something similar that would solve this?

It isn't clear what you are looking to do. Is the HTML markup okay and just the literal text needs to change? Please make it more clear what you want and what you have. Two examples of: I have **A** but want **A'** and **B** needs to be **B'** would be good, three examples would be better. — msw, Jul 16 '10 at 00:20
It's not very clear, but it looks like all you want to do is remove all the spans. Is that correct? — Dennis Williamson, Jul 16 '10 at 03:48

eruciform · Answer 1 · 2010-07-16T02:46:54.487

Like this?:

perl -pe 's/Swain/Pinky and the Brain/g;' -i lots.html of.html files.html

The breakdown:

-e = "Use code on the command line"
-p = "Execute the code on every line of every file, and print out the line, including what changed"
-i = "Actually replace the files with the new content"

If you swap out -i with -i.old then lots.html.old and of.html.old (etc) will contain the files before the changes, in case you need to go back.

This will replace just Swain with Pinky and the Brain in all the files. Further changes would require more runs of the command. Or:

s/Swain/Pinky/g; s/Twain/Brain/g;

To swap Swain with Pinky and Twain with Brain everywhere.

Update:

If you can be sure about the incoming formatting of the data, then something like this may suffice:

# cat ff.html
  <body>
  <p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span 
    class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Twain</i> (1992) </span></span></span></span></span></p>
  <p class="Notes"><span class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB"><span><span 
    class="FootnoteReference"><span lang="EN-GB" xml:lang="EN-GB" style="font-size: 10.0pt;">See <i>R v Swain</i> (1992) </span></span></span></span></span></p>
  </body>

# perl -pe 'BEGIN{undef $/;} s/<[pP][ >].*?See <i>(.*?)<\/i>(.*?)<.*?\/[pP]>/<p class="Notes">See <i>$1<\/i>$2<\/p>/gsm;' ff.html
  <body>
    <p class="Notes">See <i>R v Twain</i> (1992) </p>
    <p class="Notes">See <i>R v Swain</i> (1992) </p>
  </body>

Explanations:

BEGIN{undef $/;} = treat the whole document as one string, or else html that has newlines in it won't get handled properly
<[pP[ >] = the beginning of a p-tag (case insensitive)
.*? = lots of stuff, non-greedy-matched i.e. http://en.wikipedia.org/wiki/Regular_expression#Lazy_quantification
See  = literally look for that string - very important, since that seems to be the only common denominator
(.*?) = put more stuff into a parentheses group (to be used later)
<\/i> = the end i-tag
(.*?) = put more stuff into a parentheses group (to be used later)
<.*?\/[pP] = the end p-tag and other possible tags mashed up before it (like all your spans)
and replace it with the string you want, where $1 and $2 are what got snagged in the parentheses before, i.e. the two (.*?) 's
g = global search - so possibly more than one per line
s = treat everything like one line (which it is now due to the BEGIN at the top)

I've specified my question badly. The body text is always different but the tags need to remain the same — John, Jul 16 '10 at 00:19
updated. i would definitely try it out on one file first, then do it all with -i.old just in case. if your input files are really junky, it may not do everything you want. — eruciform, Jul 16 '10 at 00:36
uh yeah, people with the -1's, he wanted a quickie solution for a few strings. this is not a perfect solution. — eruciform, Jul 16 '10 at 16:51
You should be very wary when using regex for parsing html/xml. Read this: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html And this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — worldsayshi, Dec 20 '13 at 13:07
@worldsayshi - yes, that's definitely true. but the op didn't seem to want a fancy answer, just a simple regex with an explanation. otherwise, yes, anything more than the very simple *ml parsing is not a good idea at all. — eruciform, Jan 18 '14 at 04:32

score 0 · Answer 2 · answered Jul 28 '10 at 16:19

First convert your HTML files to proper XHTML using http://tidy.sourceforge.net and then use xmlstarlet to do the necessary XHTML processing.

Note: Get the current version of xmlstarlet for in-place XML file editing.

Here's a simple, yet complete mini-example:

curl -s http://checkip.dyndns.org > dyndns.html

tidy -wrap 0 -numeric -asxml -utf8 2>/dev/null < dyndns.html > dyndns.xml

# test: print body text to stdout (dyndns.xml)
xml sel -T \
   -N XMLNS="http://www.w3.org/1999/xhtml" \
   -t -m "//XMLNS:body" -v '.' -n \
   dyndns.xml

# edit body text in-place (dyndns.xml)
xml ed -L \
   -N XMLNS="http://www.w3.org/1999/xhtml" \
   -u "//XMLNS:body" -v '<p NEW BODY TEXT </p>' \
   dyndns.xml

# create new HTML file (by overwriting the original one!)
xml unesc < dyndns.xml > dyndns.html

score 0 · Answer 3 · answered Jul 28 '10 at 18:26

To consolidate the span tags you may use tidy (version released on 25 March 2009) as well!

# get current tidy version: http://tidy.cvs.sourceforge.net/viewvc/tidy/tidy/
# see also: http://tidy.sourceforge.net/docs/quickref.html#merge-spans

tidy -q -c --merge-spans yes file.html

Philippe A. · Answer 4 · 2010-07-16T13:20:56.847

You will have to check your input files to verify some assumptions can be made. Based on your two examples, I have made the following assumptions. You will need to check them and take some sample input files to verify you have found all assumptions.

The file consists of a single footnote contained in a single <body></body> pair. The body tags are always present and well formed.
The footnote is buried somewhere inside a  pair and one or many  tags. <!...> tags can be discarded.

The following Perl script works for both examples you have supplied (on Linux with Perl 5.10.0). Before using it, make sure you have a backup of your original html files. By default, it will only print the result on stdout without changing any file.

#!/usr/bin/perl

$overwrite = 0;

# get rid of default line separator to facilitate slurping in a $scalar var
$/ = '';
foreach $filename (@ARGV)
{
  # slurp entire file in $text variable
  open FH, "<$filename";
  $full_text = <FH>;
  close FH;

  if ($overwrite)
  {
      ! -f "$filename.bak" && rename $filename, "$filename.bak";
  }

  # match everything that is found before the body tag, everything
  # between and including the body tags, and what follows
  # s modifier causes full_text to be considered a single long string
  # instead of individual lines
  ($before_body, $body, $after_body) = ($full_text =~ m!(.*)<body>(.*)</body>(.*)!s);
  #print $before_body, $body, $after_body;

  # Discard unwanted tags from the body
  $body =~ s%<span.*?>%%sg;
  $body =~ s%</span.*?>%%sg;
  $body =~ s%<p.*?>%%sg;
  $body =~ s%</p.*?>%%sg;
  $body =~ s%<!.*?>%%sg;
  # Remaining leading and trailing whitespace likely to be newlines: remove
  $body =~ s%^\s*%%sg;
  $body =~ s%\s*$%%sg;

  if ($overwrite)
  {
    open FH, ">$filename";
    print FH $before_body, "<body>\n<p class=\"Notes\">$body</p>\n</body>", $after_body;
    close FH;
  }
  else
  {
        print $before_body, "<body>\n<p class=\"Notes\">$body</p>\n</body>", $after_body;
  }
}

To use it:

./script.pl file1.html 
./script.pl file1.html file2.html
./script.pl *.html

Tweak it and when you're happy set $overwrite=1. The script creates a .bak only if it does not already exist.

score -2 · Answer 5 · answered Jul 16 '10 at 00:38

If you have 1 entry per file, no rigid structure in these files, and possibly multiple lines, I would go for a php or perl script to process them file by file, while emitting suitable warnings when the patterns don't match.

use

php -f thescript.php

to execute thescript.php, which contains

<?php
$path = "datapath/";
$dir = opendir($path);
while ( ( $fn = readdir($dir) ) !== false )
{
    if ( preg_match("/html$/",$fn) ) process($path.$fn);
}

function process($file)
{
    $in = file_get_contents($file);
    $in2 = str_replace("\n"," ",strip_tags($in,"<i>"));
    if ( preg_match("#^(.*)<i>(.*)</i>(.*)$#i",$in2,$match) )
    {
         list($dummy,$p0,$p1,$p2) = $match;
         $out = "<body>$p0<i>$p1</i>$p2</body>";
         file_put_contents($file.".out",$out);
    } else {
         print "Problem with $file? (stripped down to: $in2)\n";
         file_put_contents($file.".problematic",$in);
    }
}
?>

you could tweak this to your needs until the number of misses is low enough to do the last few by hand. You probably need to add some $p0 = trim($p0); etc to sanitize everything.

i think the person that takes offense to parsing html may have thought that theirs is better :-P — eruciform, Jul 16 '10 at 20:39
I hate it when people mix up programming, pragmatism and religion. — mvds, Jul 16 '10 at 21:31

Editing multiple HTML files using SED (or something similar)

5 Answers5