How can I retrieve the page title of a webpage (title html tag) using Python?
-
1Since this question has been asked, many web pages have started using an og:title meta tag, which contain the original title, while
is often prefixed and suffixed with other data. Initially, used by just Facebook as a part of OpenGraph, many sites are providing OpenGraph metadata. og:title has become the standard source for a page's title, especially news articles. – Nicolas Sep 16 '18 at 16:40
12 Answers
Here's a simplified version of @Vinko Vrsalovic's answer:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string
NOTE:
soup.title finds the first title element anywhere in the html document
title.string assumes it has only one child node, and that child node is a string
For beautifulsoup 4.x, use different import:
from bs4 import BeautifulSoup
-
7Thank you! In case anyone runs into similar problems, in my Python3 environment, I had to use `urlllib.request` instead of `urllib2`. Not sure why. To avoid the BeautifulSoup warning about my parser, I had to do `soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")`. – sudo Jan 12 '16 at 18:10
-
1For python 3 use `import urllib.request as urllib` instead of `import urllib2` – Ahmad Ismail Sep 18 '20 at 23:08
-
Be aware that in case of missing title attribute OR empty title as `
` executing `soup.title.string` will return `None` – Eitanmg Oct 06 '20 at 09:58 -
I'll always use lxml for such tasks. You could use beautifulsoup as well.
import lxml.html
t = lxml.html.parse(url)
print(t.find(".//title").text)
EDIT based on comment:
from urllib2 import urlopen
from lxml.html import parse
url = "https://www.google.com"
page = urlopen(url)
p = parse(page)
print(p.find(".//title").text)

- 862
- 7
- 11

- 56,376
- 15
- 76
- 59
-
5Just in case you get IOError with the code above:http://stackoverflow.com/questions/3116269/error-with-parse-function-in-lxml – Yosh Dec 30 '13 at 10:24
-
1[lxml may have issues with Unicode](http://stackoverflow.com/q/15302125/4279), you could [use bs4.UnicodeDammit to help it find the correct character encoding](http://stackoverflow.com/a/15305248/4279) – jfs Sep 02 '14 at 13:39
No need to import other libraries. Request has this functionality in-built.
>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'

- 1,048
- 10
- 15
-
Often, "importing other libraries" seems to cause more work. Thank you for helping us avoid that! – 9-Pin Mar 01 '21 at 22:15
This is probably overkill for such a simple task, but if you plan to do more than that, then it's saner to start from these tools (mechanize, BeautifulSoup) because they are much easier to use than the alternatives (urllib to get content and regexen or some other parser to parse html)
Links: BeautifulSoup mechanize
#!/usr/bin/env python
#coding:utf-8
from bs4 import BeautifulSoup
from mechanize import Browser
#This retrieves the webpage content
br = Browser()
res = br.open("https://www.google.com/")
data = res.get_data()
#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')
#This outputs the content :)
print title.renderContents()

- 968
- 10
- 15

- 330,807
- 53
- 334
- 373
Using HTMLParser:
from urllib.request import urlopen
from html.parser import HTMLParser
class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.match = False
self.title = ''
def handle_starttag(self, tag, attributes):
self.match = tag == 'title'
def handle_data(self, data):
if self.match:
self.title = data
self.match = False
url = "http://example.com/"
html_string = str(urlopen(url).read())
parser = TitleParser()
parser.feed(html_string)
print(parser.title) # prints: Example Domain

- 5,740
- 1
- 21
- 31

- 1,999
- 2
- 24
- 29
-
It would be worthwhile to note that this script is for Python 3. the HtmlParser module was renamed to html.parser in Python 3.x. Similarly urllib.request was added in Python 3. – satishgoda Dec 13 '16 at 07:56
-
1Its probably better to explicitly convert the bytes to a string, `r=urlopen(url)`, `encoding = r.info().get_content_charset()`, and `html_string = r.read().decode(encoding)`. – reubano Jan 10 '17 at 13:27
Use soup.select_one to target title tag
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)

- 83,427
- 12
- 54
- 101
Using regular expressions
import re
match = re.search('<title>(.*?)</title>', raw_html)
title = match.group(1) if match else 'No title'

- 1,999
- 2
- 24
- 29
-
-
Hi, `group(0)` would return the entire match. See [match-objects](https://docs.python.org/3.6/library/re.html#match-objects) for reference. – Finn Jul 23 '17 at 21:45
-
1This will miss any cases where the title tags are not formed exactly as
(uppercase, mixed case, spacing) – Luke Rehmann Feb 08 '18 at 19:42 -
I would also include
in case there's other data within the title tag. – Pranav Wadhwa Jul 13 '19 at 15:40 -
I used `re.compile(r'
]*)?>([^<]*) ', re.IGNORECASE)` to address the concerns of @LukeRehmann and @PranavWadhwa. There are still plenty of cases this could go awry and if you are parsing arbitrary HTML documents, this shouldn't be used, but in my case, the HTML content is under my control so no problems there. – coderforlife Jan 03 '23 at 18:23
soup.title.string
actually returns a unicode string.
To convert that into normal string, you need to do
string=string.encode('ascii','ignore')

- 15,205
- 8
- 49
- 56

- 950
- 16
- 15
-
1That will just remove any non ascii characters which probably isn't what you want. If you really want bytes (what `encode` gives) and not a string, encode with the correct `charset`. e.g., `string.encode('utf-8')`. – reubano Jan 10 '17 at 13:25
Here is a fault tolerant HTMLParser
implementation.
You can throw pretty much anything at get_title()
without it breaking, If anything unexpected happens
get_title()
will return None
.
When Parser()
downloads the page it encodes it to ASCII
regardless of the charset used in the page ignoring any errors.
It would be trivial to change to_ascii()
to convert the data into UTF-8
or any other encoding. Just add an encoding argument and rename the function to something like to_encoding()
.
By default HTMLParser()
will break on broken html, it will even break on trivial things like mismatched tags. To prevent this behavior I replaced HTMLParser()
's error method with a function that will ignore the errors.
#-*-coding:utf8;-*-
#qpy:3
#qpy:console
'''
Extract the title from a web page using
the standard lib.
'''
from html.parser import HTMLParser
from urllib.request import urlopen
import urllib
def error_callback(*_, **__):
pass
def is_string(data):
return isinstance(data, str)
def is_bytes(data):
return isinstance(data, bytes)
def to_ascii(data):
if is_string(data):
data = data.encode('ascii', errors='ignore')
elif is_bytes(data):
data = data.decode('ascii', errors='ignore')
else:
data = str(data).encode('ascii', errors='ignore')
return data
class Parser(HTMLParser):
def __init__(self, url):
self.title = None
self.rec = False
HTMLParser.__init__(self)
try:
self.feed(to_ascii(urlopen(url).read()))
except urllib.error.HTTPError:
return
except urllib.error.URLError:
return
except ValueError:
return
self.rec = False
self.error = error_callback
def handle_starttag(self, tag, attrs):
if tag == 'title':
self.rec = True
def handle_data(self, data):
if self.rec:
self.title = data
def handle_endtag(self, tag):
if tag == 'title':
self.rec = False
def get_title(url):
return Parser(url).title
print(get_title('http://www.google.com'))

- 3,187
- 4
- 24
- 29
In Python3, we can call method urlopen
from urllib.request
and BeautifulSoup
from bs4
library to fetch the page title.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.google.com")
soup = BeautifulSoup(html, 'lxml')
print(soup.title.string)
Here we are using the most efficient parser 'lxml'.

- 968
- 10
- 15
Using lxml...
Getting it from page meta tagged according to the Facebook opengraph protocol:
import lxml.html.parse
html_doc = lxml.html.parse(some_url)
t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]
or using .xpath with lxml:
t = html_doc.xpath(".//title")[0].text

- 1,232
- 1
- 15
- 28
-
`lxml.html.parse` doesn't fetch HTML from a URL! You have to give it some actual HTML. – Zev Averbach Apr 06 '22 at 09:32