How can I retrieve the page title of a webpage using Python?

Question

How can I retrieve the page title of a webpage (title html tag) using Python?

Since this question has been asked, many web pages have started using an og:title meta tag, which contain the original title, while is often prefixed and suffixed with other data. Initially, used by just Facebook as a part of OpenGraph, many sites are providing OpenGraph metadata. og:title has become the standard source for a page's title, especially news articles. — Nicolas, Sep 16 '18 at 16:40

score 102 · Answer 1 · edited May 23 '17 at 12:18

102

Here's a simplified version of @Vinko Vrsalovic's answer:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string

NOTE:

soup.title finds the first title element anywhere in the html document
title.string assumes it has only one child node, and that child node is a string

For beautifulsoup 4.x, use different import:

from bs4 import BeautifulSoup

edited May 23 '17 at 12:18

Community

1
1

answered Sep 09 '08 at 10:32

jfs

399,953
195
994
1,670

7

Thank you! In case anyone runs into similar problems, in my Python3 environment, I had to use `urlllib.request` instead of `urllib2`. Not sure why. To avoid the BeautifulSoup warning about my parser, I had to do `soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")`. – sudo Jan 12 '16 at 18:10
1

For python 3 use `import urllib.request as urllib` instead of `import urllib2` – Ahmad Ismail Sep 18 '20 at 23:08
Be aware that in case of missing title attribute OR empty title as `` executing `soup.title.string` will return `None` – Eitanmg Oct 06 '20 at 09:58
@Eitanmg: Indeed, https://repl.it/@zed1/beautifulsoup-empty-title-is-none – jfs Oct 06 '20 at 17:01

score 68 · Accepted Answer · edited May 27 '21 at 11:54

68

I'll always use lxml for such tasks. You could use beautifulsoup as well.

import lxml.html
t = lxml.html.parse(url)
print(t.find(".//title").text)

EDIT based on comment:

from urllib2 import urlopen
from lxml.html import parse

url = "https://www.google.com"
page = urlopen(url)
p = parse(page)
print(p.find(".//title").text)

edited May 27 '21 at 11:54

cjpais

862
7
11

answered Sep 09 '08 at 04:49

Peter Hoffmann

56,376
15
76
59

5

Just in case you get IOError with the code above:http://stackoverflow.com/questions/3116269/error-with-parse-function-in-lxml – Yosh Dec 30 '13 at 10:24
1

[lxml may have issues with Unicode](http://stackoverflow.com/q/15302125/4279), you could [use bs4.UnicodeDammit to help it find the correct character encoding](http://stackoverflow.com/a/15305248/4279) – jfs Sep 02 '14 at 13:39

score 25 · Answer 3 · answered Jan 31 '17 at 12:46

25

No need to import other libraries. Request has this functionality in-built.

>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'

answered Jan 31 '17 at 12:46

Rahul Chawla

1,048
10
15

Often, "importing other libraries" seems to cause more work. Thank you for helping us avoid that! – 9-Pin Mar 01 '21 at 22:15

score 15 · Answer 4 · edited May 23 '17 at 12:26

15

The mechanize Browser object has a title() method. So the code from this post can be rewritten as:

from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()

edited May 23 '17 at 12:26

Community

1
1

answered Sep 09 '08 at 05:45

codeape

97,830
24
159
188

score 14 · Answer 5 · edited Feb 17 '21 at 06:05

This is probably overkill for such a simple task, but if you plan to do more than that, then it's saner to start from these tools (mechanize, BeautifulSoup) because they are much easier to use than the alternatives (urllib to get content and regexen or some other parser to parse html)

Links: BeautifulSoup mechanize

#!/usr/bin/env python
#coding:utf-8

from bs4 import BeautifulSoup
from mechanize import Browser

#This retrieves the webpage content
br = Browser()
res = br.open("https://www.google.com/")
data = res.get_data() 

#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')

#This outputs the content :)
print title.renderContents()

score 12 · Answer 6 · edited Oct 26 '20 at 11:46

12

Using HTMLParser:

from urllib.request import urlopen
from html.parser import HTMLParser


class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.match = False
        self.title = ''

    def handle_starttag(self, tag, attributes):
        self.match = tag == 'title'

    def handle_data(self, data):
        if self.match:
            self.title = data
            self.match = False

url = "http://example.com/"
html_string = str(urlopen(url).read())

parser = TitleParser()
parser.feed(html_string)
print(parser.title)  # prints: Example Domain

edited Oct 26 '20 at 11:46

Ricardo Branco

5,740
1
21
31

answered Apr 15 '16 at 15:07

Finn

1,999
2
24
29

It would be worthwhile to note that this script is for Python 3. the HtmlParser module was renamed to html.parser in Python 3.x. Similarly urllib.request was added in Python 3. – satishgoda Dec 13 '16 at 07:56
1

Its probably better to explicitly convert the bytes to a string, `r=urlopen(url)`, `encoding = r.info().get_content_charset()`, and `html_string = r.read().decode(encoding)`. – reubano Jan 10 '17 at 13:27

score 9 · Answer 7 · answered Jun 02 '19 at 16:16

9

Use soup.select_one to target title tag

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)

answered Jun 02 '19 at 16:16

QHarr

83,427
12
54
101

Finn · Answer 8 · 2017-07-23T21:50:52.703

8

Using regular expressions

import re
match = re.search('<title>(.*?)</title>', raw_html)
title = match.group(1) if match else 'No title'

edited Jul 23 '17 at 21:50

answered Jun 22 '17 at 10:57

Finn

1,999
2
24
29

What actually .group(1)? Any reference? – panjianom Jul 23 '17 at 20:25
Hi, `group(0)` would return the entire match. See [match-objects](https://docs.python.org/3.6/library/re.html#match-objects) for reference. – Finn Jul 23 '17 at 21:45
1

This will miss any cases where the title tags are not formed exactly as (uppercase, mixed case, spacing) – Luke Rehmann Feb 08 '18 at 19:42
I would also include in case there's other data within the title tag. – Pranav Wadhwa Jul 13 '19 at 15:40
I used `re.compile(r']*)?>([^<]*)', re.IGNORECASE)` to address the concerns of @LukeRehmann and @PranavWadhwa. There are still plenty of cases this could go awry and if you are parsing arbitrary HTML documents, this shouldn't be used, but in my case, the HTML content is under my control so no problems there. – coderforlife Jan 03 '23 at 18:23

score 2 · Answer 9 · edited Jun 15 '13 at 13:21

2

soup.title.string actually returns a unicode string. To convert that into normal string, you need to do string=string.encode('ascii','ignore')

edited Jun 15 '13 at 13:21

Joe

15,205
8
49
56

answered Jun 15 '13 at 13:05

Sai Kiriti Badam

950
16
15

1

That will just remove any non ascii characters which probably isn't what you want. If you really want bytes (what `encode` gives) and not a string, encode with the correct `charset`. e.g., `string.encode('utf-8')`. – reubano Jan 10 '17 at 13:25

score 2 · Answer 10 · answered Dec 19 '17 at 05:49

Here is a fault tolerant HTMLParser implementation.
You can throw pretty much anything at get_title() without it breaking, If anything unexpected happens get_title() will return None.
When Parser() downloads the page it encodes it to ASCII regardless of the charset used in the page ignoring any errors. It would be trivial to change to_ascii() to convert the data into UTF-8 or any other encoding. Just add an encoding argument and rename the function to something like to_encoding().
By default HTMLParser() will break on broken html, it will even break on trivial things like mismatched tags. To prevent this behavior I replaced HTMLParser()'s error method with a function that will ignore the errors.

#-*-coding:utf8;-*-
#qpy:3
#qpy:console

''' 
Extract the title from a web page using
the standard lib.
'''

from html.parser import HTMLParser
from urllib.request import urlopen
import urllib

def error_callback(*_, **__):
    pass

def is_string(data):
    return isinstance(data, str)

def is_bytes(data):
    return isinstance(data, bytes)

def to_ascii(data):
    if is_string(data):
        data = data.encode('ascii', errors='ignore')
    elif is_bytes(data):
        data = data.decode('ascii', errors='ignore')
    else:
        data = str(data).encode('ascii', errors='ignore')
    return data


class Parser(HTMLParser):
    def __init__(self, url):
        self.title = None
        self.rec = False
        HTMLParser.__init__(self)
        try:
            self.feed(to_ascii(urlopen(url).read()))
        except urllib.error.HTTPError:
            return
        except urllib.error.URLError:
            return
        except ValueError:
            return

        self.rec = False
        self.error = error_callback

    def handle_starttag(self, tag, attrs):
        if tag == 'title':
            self.rec = True

    def handle_data(self, data):
        if self.rec:
            self.title = data

    def handle_endtag(self, tag):
        if tag == 'title':
            self.rec = False


def get_title(url):
    return Parser(url).title

print(get_title('http://www.google.com'))

score 2 · Answer 11 · answered Feb 17 '21 at 02:36

In Python3, we can call method urlopen from urllib.request and BeautifulSoup from bs4 library to fetch the page title.

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.google.com")
soup = BeautifulSoup(html, 'lxml')
print(soup.title.string)

Here we are using the most efficient parser 'lxml'.

score 0 · Answer 12 · answered May 03 '19 at 11:32

0

Using lxml...

Getting it from page meta tagged according to the Facebook opengraph protocol:

import lxml.html.parse
html_doc = lxml.html.parse(some_url)

t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]

or using .xpath with lxml:

t = html_doc.xpath(".//title")[0].text

answered May 03 '19 at 11:32

markling

1,232
1
15
28

`lxml.html.parse` doesn't fetch HTML from a URL! You have to give it some actual HTML. – Zev Averbach Apr 06 '22 at 09:32

How can I retrieve the page title of a webpage using Python?

12 Answers12

Linked