0

I'm stumped. I have an HTML file that I'm trying to convert to plain text and I'm using sed to clean it up. I understand that sed works on the 'stream' and works one line at a time, but there are ways to match multiline patterns.
Here is the relevant section of my source file:

<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>&nbsp;
<span class="region">Region</span>&nbsp;&nbsp;
<span class="postal-code">1A1 A1A</span>
<span class="email">my@email.ca</span>
<span class="tel">000-000-0000</span>

I would like this to be made into the following plaintext format:

My Name

123 street
City Region  1A1 A1A
my@email.ca
000-000-0000

The key is that City, Region, and Post code are all on one line now.
I use sed -f commands.sed file.html > output.txt and I believe that the following sed program (commands.sed) should put it in that format:

#using the '@' symbol as delimiter instead of '/'
#remove tags
s@<.*>\(.*\)</.*>@\1@g
#remove the nbsp
s@\(&nbsp;\)*@@g
#add a newline before the address (actually typing a newline in the file)
s@\(123 street\)@\
\1@g
#and now the command that matches multiline patterns
#find 'City',read in the next two lines, and separate them with spaces
/City/ {
N
N
s@\(.*\)\n\(.*\)\n\(.*\)@\1 \2  \3@g
}

Seems to make sense. Tags are all stripped and then three lines are put into one.
Buuuuut it doesn't work that way. Here is the result I get:

My Name

123 street
City <span class="region">Region</span>&nbsp;&nbsp;  <span class="postal-code">1A1 A1A</span>
my@email.ca
000-000-0000

To my (relatively inexperienced) eyes, it looks like sed is 'forgetting' the changes it made (stripping off the tags). How would I solve this? Is the solution to write the file after three commands and re-run sed for the fourth? Am I misusing sed? Am I misunderstanding the 'stream' part?

I'm running Mac OS X 10.4.11 with the bash shell and using the version of sed that comes with it.

  • You might have better luck using `awk`, since that has real variables which you can populate as you process the file and then write out at the end. – Lily Ballard Oct 10 '11 at 23:58

3 Answers3

1

I think you're confused. Sed operates line-by-line, and runs all commands on the line before moving to the next. You seem to be assuming it strips the tags on all lines, then goes back and runs the rest of the commands on the stripped lines. That's simply not the case.

Lily Ballard
  • 182,031
  • 33
  • 381
  • 347
  • I probably am (since I'm still learning). That's most likely my mistake as that's _exactly_ what I'm assuming. I'll have to rethink my script then. – Jason Margolis Oct 11 '11 at 00:07
0

See RegEx match open tags except XHTML self-contained tags ... and stop using sed for this.

Sed is a wonderful tool, but not for processing HTML. I suggest using Python and BeautifulSoup, which is basically built just for this sort of task.

Community
  • 1
  • 1
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • That is very interesting. Unfortunately I have zero experience with Python. I was trying with sed since I know a bit of it and I really needed a quick and dirty solution. I am going to look into this BeautifulSoup because, as you say, it's tailor made for this. – Jason Margolis Oct 11 '11 at 00:12
0

If you have only one data block per php file, try the following (using sed)

kent$  cat t
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>&nbsp;
<span class="region">Region</span>&nbsp;&nbsp;
<span class="postal-code">1A1 A1A</span>
<span class="email">my@email.ca</span>
<span class="tel">000-000-0000</span>

kent$  sed 's/<[^>]*>//g; s/&nbsp;//g' t |sed '1G;3{N;N; s/\n/ /g}'
My Name

123 street
City Region 1A1 A1A
my@email.ca
000-000-0000
Kent
  • 189,393
  • 32
  • 233
  • 301