0

Okay, regex ninjas. I'm trying to devise a pattern to add hyperlinks to endnotes in an ePub ebook XHTML file. The problem is that numbering restarts within each chapter, so I need to add a unique identifier to the anchor name in order to hash link to it.

Given a (much simplified) list like this:

<h2>Introduction</h2>
<p> 1 Endnote entry number one.</p>
<p> 2 Endnote entry number two.</p>
<p> 3 Endnote entry number three.</p>
<p> 4 Endnote entry number four.</p>

<h2>Chapter 1: The Beginning</h2>
<p> 1 Endnote entry number one.</p>
<p> 2 Endnote entry number two.</p>
<p> 3 Endnote entry number three.</p>
<p> 4 Endnote entry number four.</p>

I need to turn it into something like this:

<h2>Introduction</h2>
<a name="endnote-introduction-1"></a><p> 1 Endnote entry number one.</p>
<a name="endnote-introduction-2"></a><p> 2 Endnote entry number two.</p>
<a name="endnote-introduction-3"></a><p> 3 Endnote entry number three.</p>
<a name="endnote-introduction-4"></a><p> 4 Endnote entry number four.</p>

<h2>Chapter 1: The Beginning</h2>
<a name="endnote-chapter-1-the-beginning-1"></a><p> 1 Endnote entry number one.</p>
<a name="endnote-chapter-1-the-beginning-2"></a><p> 2 Endnote entry number two.</p>
<a name="endnote-chapter-1-the-beginning-3"></a><p> 3 Endnote entry number three.</p>
<a name="endnote-chapter-1-the-beginning-4"></a><p> 4 Endnote entry number four.</p>

Obviously there will need to be a similar search in the actual text of the book, where each endnote will be linked to endnotes.xhtml#endnote-introduction-1 etc.

The biggest obstacle is that each match search begins AFTER the previous search ends, so unless you use recursion, you can't match the same bit (in this case, the title) for more than one entry. My attempts with recursion have so far yielded only endless loops, however.

I'm using TextWrangler's grep engine, but if you have a solution in a different editor (such as vim), that's fine too.

Thanks!

Yorb
  • 1,452
  • 1
  • 10
  • 8
  • 1
    Parsing HTML with a regular expression? Take a look at this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Penang Aug 03 '11 at 22:26

2 Answers2

1

A bit of awk might do the trick:

Create the following script (I've named it add_endnote_tags.awk):

/^<h2>/ {
    i = 0;
    chapter_name = $0;
    gsub(/<[^>]+>/, "", chapter_name);
    chapter_name = tolower(chapter_name);
    gsub(/[^a-z]+/, "-", chapter_name);
    print;
}

/^<p>/ {
    i = i + 1;
    printf("<a name=\"endnote-%s-%d\"></a>%s\n", chapter_name, i, $0);
}

$0 !~ /^<h2>/ && $0 !~ /^<p>/ {
    print;
}

And then use it to parse your file:

awk -f add_endnote_tags.awk < source_file.xml > dest_file.xml

Hope that helps. If you are on a Windows platform, you might need to install awk by either installing cygwin and the awk package or downloading gawk for Windows

beny23
  • 34,390
  • 5
  • 82
  • 85
1

I think this would be difficult to accomplish in a text editor as it requires a two-step process. First you need to section the file into chapters, then you need to process the contents of each chapter. Assuming that "endnote paragraphs" (which is where you wish to add the anchors), are defined as paragraphs having a first word equal to an integer word, then this PHP script will do what you need.

<?php
$data = file_get_contents('testdata.txt');
$output = processBook($data);
file_put_contents('testdata_out.txt', $output);
echo $output;

// Main function to process book adding endnote anchors.
function processBook($text) {
    $re_chap = '%
        # Regex 1: Get Chapter.
        <h2>([^<>]+)</h2>  # $1: Chapter title.
        (                  # $2: Chapter contents.
          .+?              # Contents are everything up to
          (?=<h2>|$)       # next chapter or end of file.
        )                  # End $2: Chapter contents.
        %six';
    // Match and process each chapter using callback function.
    $text = preg_replace_callback($re_chap, '_cb_chap', $text);
    return $text;
}
// Callback function to process each chapter.
function _cb_chap($matches) {
    // Build ID from H2 title contents.
    // Trim leading and trailing ws from title.
    $baseid = trim($matches[1]);
    // Strip all non-space, non-alphanums.
    $baseid = preg_replace('/[^ A-Za-z0-9]/', '', $matches[1]);
    // Append prefix and convert whitespans to single - dash.
    $baseid = 'endnote-'. preg_replace('/ +/', '-', $baseid);
    // Convert to lowercase.
    $baseid = strtolower($baseid);
    $text = preg_replace(
                '/(<p>\s*)(\d+)\b/',
                '<a name="'. $baseid .'-$2"></a>$1$2',
                $matches[2]);
    return '<h2>'. $matches[1] .'</h2>'. $text;

}
?>

This script correctly proccesses your example data.

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • This is kind of what I'm assuming I'll end up doing, thanks for the detailed answer! I'll reserve answer status for a bit to see if anyone has a straight regex solution... – Yorb Aug 04 '11 at 00:49