0

I'm trying to select the lines between between two markers in an html file. I've tried using sed and awk but I think there's an issue with the way i'm escaping some of the characters. I have seen some similar questions and answers, but the examples given are simple, with no special characters. I think my escaping is the issue. I need the lines between

<div class="bread crumb">

and

</div>

There is no other div within the block and there are multiple lines within the block.

Do I need to escape the characters <, > and ? as below?

sed -n -e '/^\<div class=\"bread crumb\"\>$/,/^\<\/div\>$/{ /^\<div class=\"bread crumb\">$/d; /^\<\/div>$/d; p; }'

My awk attempt :

awk '/\<div class=\"bread crumb\"\>/{flag=1;next}/\<\/div\>/{flag=0}flag'
Aserre
  • 4,916
  • 5
  • 33
  • 56
hareti
  • 41
  • 6
  • While `sed` and `awk` might be able to do the job for some input, it is considered bad practice to use non-HTML aware tools to parse HTML. You should have a look at `xpath`, which is a dedicated tool to parse HTML/XML files – Aserre Jul 21 '17 at 11:45
  • post the initial html fragment and expected result – RomanPerekhrest Jul 21 '17 at 12:04
  • obligatory [don't parse html with regex](https://stackoverflow.com/a/1732454/7552) link – glenn jackman Jul 21 '17 at 14:27

3 Answers3

1

You should use a html parser for that job.

If you still want to do it with sed, don't escape < and > that are used for word boundary.

Try this:

sed -ne '/<div class="bread crumb">/,/<\/div>/{//!p;}' file

The //!p part outputs all the block except the lines matching the address patterns.

SLePort
  • 15,211
  • 3
  • 34
  • 44
1

Actually, you just need to escape the / in the </div>, rest goes fine..

sed -n '/<div class="bread crumb">/,/<\/div>/{//!p}' 
Guru
  • 16,456
  • 2
  • 33
  • 46
0

Just use string matches in awk:

awk '$0=="</div>"{f=0} f{print} $0=="<div class=\"bread crumb\">"{f=1} ' file
Ed Morton
  • 188,023
  • 17
  • 78
  • 185