1

This is an HTML file containing a large number of <section>... </section> content in an HTML file, which has the following format.

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<section>
<div>
<header><h2>This is a title (RfQVthHm)</h2></header>
More HTML codes...
</div>
</section>

<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>

<section>
<div>
<header><h2>This is a title (vxzbXEGq)</h2></header>
More HTML codes...
</div>
</section>

</body>
</html>

I need to extract the second <section>...</section> content.

This is the expected output.

<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>

I noticed that I can look for the UaHaZWvm character first (and 2 lines ahead) until I encounter the next </section>.

OP's efforts(mentioned in comments): grep -o "hi.*bye" file

Can this be done with awk, sed or grep tools please?

E_net4
  • 27,810
  • 13
  • 101
  • 139
Lorraine1996
  • 195
  • 1
  • 1
  • 12
  • 3
    Kindly do add your efforts in form of code in your question, which is highly encouraged on SO, thank you. – RavinderSingh13 Mar 14 '21 at 13:48
  • 1
    @RavinderSingh13 Sorry, I didn't find a workable solution from a web query, so I'm asking here. I read the grep documentation earlier and found that you can use `grep -o "hi.*bye" files.html` to get the specified range of content, but that doesn't quite work. – Lorraine1996 Mar 14 '21 at 14:02
  • @Lorraine1996. You could use `awk` in paragraph mode and extract the section you want where `(UaHaZWvm)` appears. – Carlos Pascual Mar 14 '21 at 14:06
  • @Lorraine1996, Kindly do add your tried code in your question(to avoid close votes to your question), there is nothing wrong or right we all are here to learn, so please do add that shown code in your question as an efforts of yours, thank you. – RavinderSingh13 Mar 14 '21 at 14:08
  • @CarlosPascual Sorry, I'll check the awk documentation. Will update here if there is any progress. – Lorraine1996 Mar 14 '21 at 14:12
  • Your question isn't clear - are you trying to print the **2nd section** no matter what it contains or the section that **contains UaHaZWvm** no matter what order it appears in? – Ed Morton Mar 14 '21 at 14:31
  • @EdMorton I needed to extract the `section` where the `UaHaZWvm` character was located. now I have solved the problem and added the answer. Thank for your attention! – Lorraine1996 Mar 15 '21 at 12:21

6 Answers6

5

Since you're working with HTML, it's much simpler and better to use a tool that's aware of the format, like xmllint or some other program that lets you use XPath expressions to extract part of the document:

$ xmllint --html --xpath '//section[2]' input.html 2>/dev/null
<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>               

(xmllint gives a lot of errors about tags; I don't think it really supports HTML5? Anyways, that's why there's the redirection of standard error in the above.)


Alternative using hxselect from the W3C's HTML-XML-utils collection of programs. Instead of XPath, it uses a CSS selector to specify what to fetch from the document:

hxselect 'section:nth-child(2)' < input.html
Shawn
  • 47,241
  • 3
  • 26
  • 60
4

With your shown samples, could you please try following. Written and tested in GNU awk, should work in any awk.

awk '
/^<\/section>/{
  if(found1==2 && found2==1){
    print val
    exit
  }
  found2++
}
/<section>/{
  found1++
}
found1==2{
  val=(val?val ORS:"")$0
}
'  Input_file

Explanation: Adding detailed explanation for above.

awk '                             ##Starting awk program from here.
/^<\/section>/{                   ##Checking condition if line starts from </section> here.
  if(found1==2 && found2==1){     ##Checking condition if found1 is 2 AND found2 is 1 then do following.
    print val                     ##printing val here.
    exit                          ##exiting from program from here.
  }
  found2++                        ##Increasing found2 with 1 here.
}
/<section>/{                      ##Checking condition if line has <section> then do following.
  found1++                        ##Increasing found1 with 1 here.
}
found1==2{                        ##Checking if found1 is 2 then do following.
  val=(val?val ORS:"")$0          ##Creating val and keep adding lines into it.
}
'
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
3

With awk in paragraph mode:

awk -v RS= -v ORS='\n\n' '/UaHaZWvm/' file
<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>
Carlos Pascual
  • 1,106
  • 1
  • 5
  • 8
3
gawk '/<section>/,/<\/section>/{ s=s $0; }
      /<\/section>/{ i++; print i, s; s=""; }
      END{ if(s!="") print i,s}' some.html

Will print all sections, like:

1 <section><div><header><h2>This is a title (RfQVthHm)</h2></header>More HTML codes...</div></section>
2 <section><div><header><h2>This is a title (UaHaZWvm)</h2></header>More HTML codes...</div></section>
3 <section><div><header><h2>This is a title (vxzbXEGq)</h2></header>More HTML codes...</div></section>

This works with Patterns, see man-page from gawk, or awk.

It should be easy to only return the second one...

EDIT: (based on the comments from Ed M.)

gawk '/<section>/{ i=(i<0?-i:i); i++; }
      /<\/section>/{ i=-i; }
      { a[i]=a[i] $0 }
      END{ print a[2] }' some.html

With grep you can do: grep 'UaHaZWvm' -B2 -A3 some.html which outputs:

<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>
Luuk
  • 12,245
  • 5
  • 22
  • 33
3

It's not clear from your question if you're trying to print the 2nd section no matter what it contains or the section that contains UaHaZWvm no matter what order it appears in so here's both solutions:

To print the 2nd section:

$ awk -v RS= -v ORS='\n\n' 'NR==3' file
<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>

To print any section that contains UaHaZWvm:

$ awk -v RS= -v ORS='\n\n' '/UaHaZWvm/' file
<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

Update my solution, hope it will be useful for others.

This is a solution in combination with grep, using the -B option to set the beginning of the content, and the -A option to output the rest of the content (10,000 lines is usually enough to use), and then using sed or awk to locate the closing keyword.

awk

cat test.html | grep 'UaHaZWvm' -B2 -A10000 | awk 'NR==1,/<\/section>/'

sed

cat test.html | grep 'UaHaZWvm' -B2 -A10000 | sed -n '1,/<\/section>/p'
Lorraine1996
  • 195
  • 1
  • 1
  • 12
  • Don't do that - it has a [UUOC](http://porkmail.org/era/unix/award.html), 3 commands, hard-coded -2, and a guess that 10,000 is usually enough. Just use the concise, robust, efficent `awk -v RS= -v ORS='\n\n' '/UaHaZWvm/' file` - it will simply work. – Ed Morton Mar 15 '21 at 12:50