Find H1 Text Using Bash

Question

What's regex to find text "This is the title" inside this tags ? Using Grep, Sed or Awk.

Code Example:

<h1 class="round title">
  <a href="/somepage">This is the title</a>
</h1>

I've tried this on above h1 tag.

curl --silent http://domain.com/index.html | grep "<h1 class=\"round title\">"

Result is:

<h1 class="round title"><a href="/somepage">This is the title</a></h1>

and I only need "This is the title" part of it.

`grep` is completely out of the question here, because it operates on a line at a time. Sed or awk can handle simple cases, but for adequate processing of structured data you really do need to use a tool which can handle the structure. See e.g. http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — tripleee, Sep 03 '12 at 05:48
I don't know I've seen grep work on multiple lines somehow. Thanks for the link will check it out. — Tux, Sep 03 '12 at 05:49

score 1 · Answer 1 · answered Sep 03 '12 at 06:14

1

I got it with this following command.

curl --silent http://domain.com/index.html | grep -E "<h1.*><a.*>(.*?)</a></h1>" | sed 's/.*<a.*>\(.*\)<\/a>.*/\1/'

Thank you all.

answered Sep 03 '12 at 06:14

Tux

1 Answers1