2
cat 1.html | grep "<title>" > title.txt  

This grep statement is not working.

Please tell the best way to grab the title of a page using grep or sed.

Thanks.

DomainsFeatured
  • 1,426
  • 1
  • 21
  • 39
Vamsi Krishna B
  • 11,377
  • 15
  • 68
  • 94
  • 6
    [Friends don't let friends parse HTML with regular expressions.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) (stolen from [Ether](http://stackoverflow.com/users/40468/ether)). – Oded Sep 30 '10 at 17:27
  • 2
    his requirement is very simple. Regular expression is ok. – ghostdog74 Oct 01 '10 at 02:47

6 Answers6

5

you can use awk. This works even for multiline

$ cat file

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>

    <title>Extract Title of a html file

using grep - Stack Overflow</title>
    <link rel="stylesheet" type="text/css" href="http://sstatic.net/stackoverflow/all.css?v=9ea1a272f146">

$ awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}' file
Extract Title of a html file using grep - Stack Overflow
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • Hey GhostDog74, I created a follow up question to this so the awk command would operate on several lines not just one, if you could give your input at all, that would be great - http://stackoverflow.com/questions/39650612/how-to-extract-text-between-html-tags-with-or-condition-multiple-times – DomainsFeatured Sep 23 '16 at 00:27
4
sed -n 's/<title>\(.*\)<\/title>/\1/Ip' 1.html

uses the combination of -n and p to only print matches

Dave
  • 416
  • 2
  • 5
3

You can use xml_grep from the XML::Twig Perl package:

xml_grep --text_only title 1.html
Alex Howansky
  • 50,515
  • 8
  • 78
  • 98
2
cat 1.html | grep -oE "<title>.*</title>" | sed 's/<title>//' | sed 's/<\/title>//'

Grep with -oE extract only the title tag, then sed remove the html tags

Olivier Lasne
  • 679
  • 6
  • 12
0
 grep "<title>" /path/to/html.html

Works fine for me. Are you sure 1.html is in your current working directory? pwd to check.

chigley
  • 2,562
  • 20
  • 18
0

Alex Hovansky's answer is good enough, although there is a chance that html is not well formed and your xml_grep would crash

I recommend use tidy to convert html to xml, then use xml_grep

tidy -asxml -utf8 html_file.html > out.xml
xml_grep 'xpath_expression' out.xml