Extract Title of a html file using grep

Question

cat 1.html | grep "<title>" > title.txt

This grep statement is not working.

Please tell the best way to grab the title of a page using grep or sed.

Thanks.

[Friends don't let friends parse HTML with regular expressions.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) (stolen from [Ether](http://stackoverflow.com/users/40468/ether)). — Oded, Sep 30 '10 at 17:27

score 5 · Accepted Answer · answered Sep 30 '10 at 23:25

5

you can use awk. This works even for multiline

$ cat file

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>

    <title>Extract Title of a html file

using grep - Stack Overflow</title>
    <link rel="stylesheet" type="text/css" href="http://sstatic.net/stackoverflow/all.css?v=9ea1a272f146">

$ awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}' file
Extract Title of a html file using grep - Stack Overflow

answered Sep 30 '10 at 23:25

ghostdog74

327,991
56
259
343

Hey GhostDog74, I created a follow up question to this so the awk command would operate on several lines not just one, if you could give your input at all, that would be great - http://stackoverflow.com/questions/39650612/how-to-extract-text-between-html-tags-with-or-condition-multiple-times – DomainsFeatured Sep 23 '16 at 00:27

score 4 · Answer 2 · answered Sep 30 '10 at 17:46

4

sed -n 's/<title>\(.*\)<\/title>/\1/Ip' 1.html

uses the combination of -n and p to only print matches

answered Sep 30 '10 at 17:46

Dave

416
2
5

This assumes that nothing else appears on the line containing `` tag. – Untitled Jul 22 '16 at 07:37

score 3 · Answer 3 · answered Sep 30 '10 at 17:19

3

You can use xml_grep from the XML::Twig Perl package:

xml_grep --text_only title 1.html

answered Sep 30 '10 at 17:19

Alex Howansky

50,515
8
78
98

score 2 · Answer 4 · answered Sep 14 '17 at 08:24

2

cat 1.html | grep -oE "<title>.*</title>" | sed 's/<title>//' | sed 's/<\/title>//'

Grep with -oE extract only the title tag, then sed remove the html tags

answered Sep 14 '17 at 08:24

Olivier Lasne

679
6
12

score 0 · Answer 5 · answered Sep 30 '10 at 17:24

0

 grep "<title>" /path/to/html.html

Works fine for me. Are you sure 1.html is in your current working directory? pwd to check.

answered Sep 30 '10 at 17:24

chigley

2,562
20
18

score 0 · Answer 6 · answered May 23 '17 at 14:41

Alex Hovansky's answer is good enough, although there is a chance that html is not well formed and your xml_grep would crash

I recommend use tidy to convert html to xml, then use xml_grep

tidy -asxml -utf8 html_file.html > out.xml
xml_grep 'xpath_expression' out.xml

Extract Title of a html file using grep

6 Answers6

Linked