2

I have the following text layout:

Heading
Chapter 1:1 This is text
2 This is more text
3 This is more text
4 This is more text
5 This is more text
6 This is more text
7 This is more text
8 This is more text
9 This is more text
10 This is more text
11 This is more text
12 This is more text
13 This is more text
14 This is moret text 
15 This is more text
Heading    
Chapter 2:1 This is text
2 This is more text...

and I am trying to add the first Chapter reference and the last one in that Chapter right after the Heading, written in parentheses. Like so:

Heading (Chapter 1:1-15)
Chapter 1:1 This is text
2 This is more text
3 This is more text
4 This is more text
5 This is more text
6 This is more text
7 This is more text
8 This is more text
9 This is more text
10 This is more text
11 This is more text
12 This is more text
13 This is more text
14 This is moret text 
15 This is more text

I've come up with this regular expression so far:

~s/(?s)(Heading)\r(^\d*\w+\s*\d+:\d+|\d+:\d+)(.*?)(\d+)(.*?\r)(?=Heading)/\1 (\2-\4)\r\2\3\4\5/g;

but this is grabbing the first number right after Chapter 1:1 (i.e. "2", "Heading (Chapter 1:1-2)"), instead of the last one ("15" as in "Heading (Chapter 1:1-15)"). Could someone please tell me what's wrong with the regex? Thank you!

RGP
  • 87
  • 1
  • 5
  • 8
    Rather than whacking a problem with a mega regex -- one you won't be able to understand and maintain over the long term -- you're usually better office to break the logic down into small, easily understood steps: (a) read in a section; (b) collect the needed info; (c) modify the first line; (d) print the section; (e) repeat. – FMc Aug 24 '11 at 13:41
  • Actually, I made a mistake. Heading 1 and Heading 2 always have the same name (i.e., Heading), which makes the regex more difficult. If they had different names, making the (.*?) greedy (.*) would do the trick. – RGP Aug 24 '11 at 18:41

2 Answers2

2

An implementation of @FMc's comment could be something like:

#!/usr/bin/perl
use warnings;
use strict;

my $buffer = '';
while (<DATA>) {
    if (/^Heading \d+/) { # process previous buffer, and start new buffer
        process_buffer($buffer);
        $buffer = $_;
    }
    else { # add to buffer
        $buffer .= $_;
    }
}
process_buffer($buffer);   # don't forget last buffer's worth...


sub process_buffer {
    my($b) = @_;

    return unless length $b;  # don't bother with an unpopulated buffer

    my($last) = $b =~ /(\d+)\s.*$/;
    my($chap) = $b =~ /^(Chapter \d+:\d+)/m;
    $b =~ s/^(Heading \d+)/$1 ($chap-$last)/;

    print $b;
}

__DATA__
Heading 1
Chapter 1:1 This is text
2 This is more text
3 This is more text
4 This is more text
5 This is more text
6 This is more text
7 This is more text
8 This is more text
9 This is more text
10 This is more text
11 This is more text
12 This is more text
13 This is more text
14 This is moret text
15 This is more text
Heading 2
Chapter 2:1 This is text
2 This is more text...
3 This is more text
tadmc
  • 3,714
  • 16
  • 14
  • This is excellent, @tadmc. It works fine, although all the stuff about buffer is a bit over my head... Thanks a bunch! – RGP Aug 24 '11 at 18:38
2

Edit for updated question

Here's a regex with explanation that will solve your problem. http://codepad.org/mSIYCw4R

~s/
((?:^|\n)Heading)   #Capture Heading into group 1.
                    #We can't use lookbehind because of (?:^|\n)
(?=                 #A lookahead, but don't capture.
  \nChapter\s       #Find the Chapter text.
  (\d+:\d+)         #Get the first chapter text. and store in group 2
  .*                #Capture the rest of the Chapter line.
  (?:\n(\d+).+)+    #Capture every chapter line.
                    #The last captured chapter number gets stored into group 3.
)
/$1 (Chapter $2-$3)/gx;
Jacob Eggers
  • 9,062
  • 2
  • 25
  • 43
  • +1 Even though I advised against a regex, this is well done. Also, didn't know about codepad.org -- thanks for the tip. FWIW, it's recommended these days to use `$1` in the replacement rather than `\1`. For example, see http://stackoverflow.com/questions/3068236, or run your code with `use warnings` in effect. – FMc Aug 25 '11 at 00:40
  • @FMc Thanks for the tip. I had no idea there was a preference, so many examples use the `\1`. There are many times that I wish the rest of the web were like stackoverflow, so I could fix bad recommendations. Someone should make a stackoverflow browser. – Jacob Eggers Aug 25 '11 at 01:07
  • @Jacob Eggers. Thank you for an excellent solution and even better explanation. Regexes rule! – RGP Aug 25 '11 at 06:40