I'm stumped. I have an HTML file that I'm trying to convert to plain text and I'm using sed
to clean it up. I understand that sed
works on the 'stream' and works one line at a time, but there are ways to match multiline patterns.
Here is the relevant section of my source file:
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my@email.ca</span>
<span class="tel">000-000-0000</span>
I would like this to be made into the following plaintext format:
My Name
123 street
City Region 1A1 A1A
my@email.ca
000-000-0000
The key is that City, Region, and Post code are all on one line now.
I use sed -f commands.sed file.html > output.txt
and I believe that the following sed program (commands.sed
) should put it in that format:
#using the '@' symbol as delimiter instead of '/'
#remove tags
s@<.*>\(.*\)</.*>@\1@g
#remove the nbsp
s@\( \)*@@g
#add a newline before the address (actually typing a newline in the file)
s@\(123 street\)@\
\1@g
#and now the command that matches multiline patterns
#find 'City',read in the next two lines, and separate them with spaces
/City/ {
N
N
s@\(.*\)\n\(.*\)\n\(.*\)@\1 \2 \3@g
}
Seems to make sense. Tags are all stripped and then three lines are put into one.
Buuuuut it doesn't work that way. Here is the result I get:
My Name
123 street
City <span class="region">Region</span> <span class="postal-code">1A1 A1A</span>
my@email.ca
000-000-0000
To my (relatively inexperienced) eyes, it looks like sed is 'forgetting' the changes it made (stripping off the tags). How would I solve this? Is the solution to write the file after three commands and re-run sed for the fourth? Am I misusing sed? Am I misunderstanding the 'stream' part?
I'm running Mac OS X 10.4.11 with the bash
shell and using the version of sed
that comes with it.