0

What's regex to find text "This is the title" inside this tags ? Using Grep, Sed or Awk.

Code Example:

<h1 class="round title">
  <a href="/somepage">This is the title</a>
</h1>

I've tried this on above h1 tag.

curl --silent http://domain.com/index.html | grep "<h1 class=\"round title\">"

Result is:

<h1 class="round title"><a href="/somepage">This is the title</a></h1>

and I only need "This is the title" part of it.

Tux
  • 1,773
  • 4
  • 16
  • 19
  • 1
    For the general case, you need to use an HTML parser. – pizza Sep 03 '12 at 03:16
  • If I needed to use HTML parser, I would. But I need bash =) – Tux Sep 03 '12 at 04:46
  • `grep` is completely out of the question here, because it operates on a line at a time. Sed or awk can handle simple cases, but for adequate processing of structured data you really do need to use a tool which can handle the structure. See e.g. http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – tripleee Sep 03 '12 at 05:48
  • I don't know I've seen grep work on multiple lines somehow. Thanks for the link will check it out. – Tux Sep 03 '12 at 05:49
  • Yeah the link provided is not helping, but thanks. – Tux Sep 03 '12 at 05:50

1 Answers1

1

I got it with this following command.

curl --silent http://domain.com/index.html | grep -E "<h1.*><a.*>(.*?)</a></h1>" | sed 's/.*<a.*>\(.*\)<\/a>.*/\1/'

Thank you all.

Tux
  • 1,773
  • 4
  • 16
  • 19