sed how to read in and process file of unknown length

Question

I want to insert into a html source file a marked-up (into html) other text file of unknown length, but always at least two lines. I was going to use m4, but "include" reads the whole file AFAIK. So, on to sed...

Once I have found the pattern that indicates the start of the insertion point, the first line will be appended to <div class=...> tags, and the second similarly (but different class), and then loop until EOF, then the rest of the source file is output.

Finding the insertion point is ok, as is printing the remainder of the source file. I am having a problem with sed looping to read in the text file until it is done.

Example input

title1
author1
title2
author2
...
titleN
authorN

Desired output

<!-- above here is source file, below is sed'ed output -->
<div class="title">
title1
</div>
<div class="author">
author1
</div>
<div class="title">
title2
</div>
<div class="author">
author2
</div>
...
<div class="title">
titleN
</div>
<div class="author">
authorN
</div>
<!-- below is rest of source file -->

I am not too concerned with line breaks, all on one line is fine, the example is just to make it clear what is going on. `

I can get it to work fine with a \ <div .... and R filename and so on with the simple case of two or four lines of input. As soon as I try to use a loop to handle the case of a variable number of lines of input, I fail.

I tried using a dummy substitution s|^$.+$|\1| so I can test it with T and exit if the pattern match was empty, but it doesn't work. My other attempt resulted in sed going into an infinite loop.

How can you test whether R succeeded or failed? Is there a design pattern I am missing here?

(I'm using GNU sed, so R and T are ok.)

Thanks.

Just a heads up since there's HTML & regex involved here: [parsing HTML with regex is not a wise idea](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — Andrew Marshall, Mar 06 '12 at 04:39
How do you recognize which line is title, which line is author? Or are all odd lines titles, and all even lines authors? — ghoti, Mar 06 '12 at 04:43
@AndrewMarshall Thanks. The link was interesting reading. I'm actually just testing for a non-empty line rather than any tags, but point taken. — Nick Coleman, Mar 06 '12 at 05:53
@ghoti The file is defined as (title\nauthor\n){1,}, so, within each couplet, the first line is the title and the second is the author. — Nick Coleman, Mar 06 '12 at 05:56

ghoti · Answer 1 · 2012-03-06T05:00:45.533

Don't think of sed only as a language that loops through lines. You can specify a range of lines by matching the first and last line as a range of lines:

sed '/firstRE/,/secondRE/s/ThingsBetweenLines/ReplaceWithThis/'

For example:

[ghoti@pc ~]$ printf 'one\ntwo\nthree\nfour\nfive\n' | sed '/two/,/four/s/[ore]/_/g'
one
tw_
th___
f_u_
five
[ghoti@pc ~]$

The catch is that sed isn't really good at inserting whole LINES, and sed doesn't really have a way of saying "the current line number is even/odd". Multiline stuff is arcane and ugly. Gnu sed does, if I recall, have some multi-line notation, but it's late at night and I can never remember how to use the non-standard stuff.

So I recommend awk. :) Its code is easier to read, and it's better suited to this sort of task.

awk '
  BEGIN {
    fmt="<div class=\"title\">%s</div>\n<div class=\"author\">%s</div>\n";
  }
  {
    title=$0; getline; author=$0;
    printf(fmt, title, author);
  }
'

Of course, you can also do this in pure shell:

#!/bin/sh

fmt="<div class=\"title\">%s</div>\n<div class=\"author\">%s</div>\n"

while read line; do
  if [ -z "$title" ]; then
    title="$line"
    continue
  fi
  author="$line"
  printf "$fmt" "$title" "$author"
  title=''
done

See, it works for me:

[ghoti@pc ~/tmp]$ printf 'title1\nauthor1\ntitle2\nauthor2\n' | ./doit
<div class="title">title1</div>
<div class="author">author1</div>
<div class="title">title2</div>
<div class="author">author2</div>
[ghoti@pc ~/tmp]$ printf 'title1\nauthor1\ntitle2\nauthor2\n' | ./doit.awk
<div class="title">title1</div>
<div class="author">author1</div>
<div class="title">title2</div>
<div class="author">author2</div>
[ghoti@pc ~/tmp]$

Thanks. You've convinced me to use awk, which works fine. I was sure I was missing something because I could not see the point of being able to read line-by-line without knowing when EOF occurs. I still can't see the point, but, hey, I guess the GNU folk found a use for it. — Nick Coleman, Mar 16 '12 at 11:00

potong · Answer 2 · 2012-03-06T19:31:00.723

1

This might work for you (GNU sed):

cat <<! >couplet.sed
N;s/\(.*\)\n\(.*\)/<div class="title">\1<\/div><div class="author">\2<\/div>/
!
sed '/^<!-- below is rest of source file -->/e sed -f couplet.sed data' source
!-- above here is source file, below is sed'ed output -->
<div class="title">title1</div><div class="author">author1</div>
<div class="title">title2</div><div class="author">author2</div>
...
<div class="title">titleN</div><div class="author">authorN</div>
<!-- below is rest of source file -->

What is needed is a sed program within a sed command. This is achieved using the e command.

N.B. The sed program can be replaced with any bash command/script/etc.

Explanation:

Create a sed script which reads the data file 2 lines at a time and produces the desired div class's
Read the source file until the insertion point and then run the above script. The e command inserts the output from the results of the couplet.sed run against the data file into the output of the sed oneliner.

The e command can be run in three ways:

As a flag of the s command. Which evaluates anything in the RHS s/PATTERN/COMMAND/e
As a stand-alone command inserted into the output stream e.g. 1e date
Without parameters it evaluates whatever is in pattern space.

An alternative sed solution:

sed -e 'N;s/\(.*\)\n\(.*\)/\/^<!-- below is rest of source file -->\/i\\<div class="title">\1<\/div><div class="author">\2<\/div>/' data |
sed -f - source

edited Mar 06 '12 at 19:31

answered Mar 06 '12 at 10:44

potong

55,640
6
51
83

Your `couplet.sed` script is both well-named and effective - well done. The rest of the script is inscrutable, and I don't see what you are expecting it to do. I'd like to up-vote, but I can't yet. – Jonathan Leffler Mar 06 '12 at 14:18
@JonathanLeffler I've added an explanation to the solution. – potong Mar 06 '12 at 14:50
Oh, I think I see...the `e` command in GNU `sed` means 'execute the following as a shell script with its standard output going to the output of the main `sed` script, and its standard input coming from `/dev/null`, or thereabouts. Ick, and likewise Yuck. But if that's supported by GNU `sed`, I suppose it gives it some legitimacy. It won't work with any normal `sed`, and I'm not at all sure I like it...but maybe I'm just too old fashioned. – Jonathan Leffler Mar 06 '12 at 15:09
@JonathanLeffler the `e` command was introduced into GNU sed in version 3.95 around about 2002. – potong Mar 06 '12 at 15:50
It still isn't in POSIX, and I work with POSIX more than tuning my experience to GNU, precisely so I don't run into problems on systems that do not use GNU `sed` by default (such as Mac OS X, Solaris, HP-UX or AIX - that's 4 of the 5 Unix-like platforms that I work on that do not support the 'e' command in `sed`). You're not obliged to be constrained by the constraints I work under; but it is as well to be aware of when you are using a GNU extension to the standard. – Jonathan Leffler Mar 06 '12 at 17:38
@JonathanLeffler point taken. I've added a non GNU sed solution. – potong Mar 06 '12 at 19:33

Jonathan Leffler · Answer 3 · 2012-03-06T14:29:22.933

You have two input files. One consists of:

some text
insertion point pattern
rest of the text

plus the list of alternating title and author lines in a second file.

And the output should be:

some text
insertion point pattern
...alternating list of title and author <div>s
rest of the text

I think the easiest way to deal with this is:

Process the title/author list (from the title.authors file) into a temporary file.
Have sed read the temporary file at the insertion point.

This translates to the outline:

tmp=${TMPDIR:-/tmp}/at.$$     # Or use mktemp command
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15

sed -e 'N' \
    -e 's%\(.*\)\n\(.*\)%<div class="title">\1</div>\n<div class="author">\2</div>%' \
    title.authors > $tmp

sed "/insertion point pattern/r $tmp" main-file > output-file

rm -f $tmp
trap 0

The details with the trap commands ensure that the script cleans up after itself if it is sent a HUP, INT, QUIT, PIPE or TERM signal.

The first sed script uses N to combine adjacent lines, so it gives the title and the author on two lines in the pattern space. The other line then collects the material on either side of the newline into \1 and \2, which are then tagged up.

The second sed script identifies the insertion point, prints that line, reads the preprocessed file of titles and authors (note the double quotes to allow the shell to expand $tmp) immediately before reading the next line.

It is a mild nuisance to need the temporary file, but doing so cleanly separates the different duties of 'formatting the title and author information' and 'copying the formatted title and author information to the correct place in the data stream.

If you need the marker HTML/XML comments in the output, you can complicate your pre-processing script with:

   -e '1i\
      <!-- above here is source file, below is sed'ed output -->' \
   -e '$a\
      <!-- below is rest of source file -->'

Be aware that the leading blanks will be included in the output. If that matters, put the whole of the first script into a file (title-author.sed) and use sed -f title-author.sed title.authors > $tmp to preprocess the information:

title-author.sed

1i\
<!-- above here is source file, below is sed'ed output -->
$a\
<!-- below is rest of source file -->
N
s%\(.*\)\n\(.*\)%<div class="title">\1</div>\n<div class="author">\2</div>%

The downside of this is the extra file - the sed script. You could generate it on the fly as another temp file, of course. My trick then is to use:

tmp=${TMPDIR:-/tmp}/at.$$
trap "rm -f $tmp.?; exit 1" 0 1 2 3 13 15

cat > $tmp.1 <<'EOF'
1i\
<!-- above here is source file, below is sed'ed output -->
$a\
<!-- below is rest of source file -->
N
s%\(.*\)\n\(.*\)%<div class="title">\1</div>\n<div class="author">\2</div>%
EOF

sed -f $tmp.1 title.authors > $tmp.2

sed "/insertion point pattern/r $tmp.2" main-file > output-file

rm -f $tmp.?
trap 0

The change is to use the generated temporary name as a prefix, and the actual temporary files are $tmp.1, $tmp.2. The clean-up is just marginally different, to reflect that there could be multiple temporary files to remove.

Clearly, you can arrange for the two input files to be parameters to the script, and simply leave the script writing to standard output so that you can redirect its output wherever you want, rather than forcing it to output-file. A general purpose script should, in fact, do that.

score 0 · Answer 4 · answered Aug 15 '16 at 00:12

0

That's not a job for sed, it's a job for awk:

awk 'NR==FNR{a[NR]=$0; next} {print} /<div class=/{print a[++c]}' file1.txt file2.html

answered Aug 15 '16 at 00:12

Ed Morton

188,023
17
78
185

1

I now know. I was young and naive back in those days, sort of 1 Corinthians 13:11. But since yesterday, 1 Corinthians 13:10 ;-) – Nick Coleman Aug 16 '16 at 01:41

sed how to read in and process file of unknown length

4 Answers4

title-author.sed