88

How can I retrieve the page title of a webpage (title html tag) using Python?

cschol
  • 12,799
  • 11
  • 66
  • 80
  • 1
    Since this question has been asked, many web pages have started using an og:title meta tag, which contain the original title, while is often prefixed and suffixed with other data. Initially, used by just Facebook as a part of OpenGraph, many sites are providing OpenGraph metadata. og:title has become the standard source for a page's title, especially news articles. – Nicolas Sep 16 '18 at 16:40

12 Answers12

102

Here's a simplified version of @Vinko Vrsalovic's answer:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string

NOTE:

  • soup.title finds the first title element anywhere in the html document

  • title.string assumes it has only one child node, and that child node is a string

For beautifulsoup 4.x, use different import:

from bs4 import BeautifulSoup
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 7
    Thank you! In case anyone runs into similar problems, in my Python3 environment, I had to use `urlllib.request` instead of `urllib2`. Not sure why. To avoid the BeautifulSoup warning about my parser, I had to do `soup = BeautifulSoup(urllib.request.urlopen(url), "lxml")`. – sudo Jan 12 '16 at 18:10
  • 1
    For python 3 use `import urllib.request as urllib` instead of `import urllib2` – Ahmad Ismail Sep 18 '20 at 23:08
  • Be aware that in case of missing title attribute OR empty title as `` executing `soup.title.string` will return `None` – Eitanmg Oct 06 '20 at 09:58
  • @Eitanmg: Indeed, https://repl.it/@zed1/beautifulsoup-empty-title-is-none – jfs Oct 06 '20 at 17:01
68

I'll always use lxml for such tasks. You could use beautifulsoup as well.

import lxml.html
t = lxml.html.parse(url)
print(t.find(".//title").text)

EDIT based on comment:

from urllib2 import urlopen
from lxml.html import parse

url = "https://www.google.com"
page = urlopen(url)
p = parse(page)
print(p.find(".//title").text)
cjpais
  • 862
  • 7
  • 11
Peter Hoffmann
  • 56,376
  • 15
  • 76
  • 59
  • 5
    Just in case you get IOError with the code above:http://stackoverflow.com/questions/3116269/error-with-parse-function-in-lxml – Yosh Dec 30 '13 at 10:24
  • 1
    [lxml may have issues with Unicode](http://stackoverflow.com/q/15302125/4279), you could [use bs4.UnicodeDammit to help it find the correct character encoding](http://stackoverflow.com/a/15305248/4279) – jfs Sep 02 '14 at 13:39
25

No need to import other libraries. Request has this functionality in-built.

>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb' 
Rahul Chawla
  • 1,048
  • 10
  • 15
  • Often, "importing other libraries" seems to cause more work. Thank you for helping us avoid that! – 9-Pin Mar 01 '21 at 22:15
15

The mechanize Browser object has a title() method. So the code from this post can be rewritten as:

from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()
Community
  • 1
  • 1
codeape
  • 97,830
  • 24
  • 159
  • 188
14

This is probably overkill for such a simple task, but if you plan to do more than that, then it's saner to start from these tools (mechanize, BeautifulSoup) because they are much easier to use than the alternatives (urllib to get content and regexen or some other parser to parse html)

Links: BeautifulSoup mechanize

#!/usr/bin/env python
#coding:utf-8

from bs4 import BeautifulSoup
from mechanize import Browser

#This retrieves the webpage content
br = Browser()
res = br.open("https://www.google.com/")
data = res.get_data() 

#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')

#This outputs the content :)
print title.renderContents()
S Habeeb Ullah
  • 968
  • 10
  • 15
Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373
12

Using HTMLParser:

from urllib.request import urlopen
from html.parser import HTMLParser


class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.match = False
        self.title = ''

    def handle_starttag(self, tag, attributes):
        self.match = tag == 'title'

    def handle_data(self, data):
        if self.match:
            self.title = data
            self.match = False

url = "http://example.com/"
html_string = str(urlopen(url).read())

parser = TitleParser()
parser.feed(html_string)
print(parser.title)  # prints: Example Domain
Ricardo Branco
  • 5,740
  • 1
  • 21
  • 31
Finn
  • 1,999
  • 2
  • 24
  • 29
  • It would be worthwhile to note that this script is for Python 3. the HtmlParser module was renamed to html.parser in Python 3.x. Similarly urllib.request was added in Python 3. – satishgoda Dec 13 '16 at 07:56
  • 1
    Its probably better to explicitly convert the bytes to a string, `r=urlopen(url)`, `encoding = r.info().get_content_charset()`, and `html_string = r.read().decode(encoding)`. – reubano Jan 10 '17 at 13:27
9

Use soup.select_one to target title tag

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)
QHarr
  • 83,427
  • 12
  • 54
  • 101
8

Using regular expressions

import re
match = re.search('<title>(.*?)</title>', raw_html)
title = match.group(1) if match else 'No title'
Finn
  • 1,999
  • 2
  • 24
  • 29
  • What actually .group(1)? Any reference? – panjianom Jul 23 '17 at 20:25
  • Hi, `group(0)` would return the entire match. See [match-objects](https://docs.python.org/3.6/library/re.html#match-objects) for reference. – Finn Jul 23 '17 at 21:45
  • 1
    This will miss any cases where the title tags are not formed exactly as (uppercase, mixed case, spacing) – Luke Rehmann Feb 08 '18 at 19:42
  • I would also include in case there's other data within the title tag. – Pranav Wadhwa Jul 13 '19 at 15:40
  • I used `re.compile(r']*)?>([^<]*)', re.IGNORECASE)` to address the concerns of @LukeRehmann and @PranavWadhwa. There are still plenty of cases this could go awry and if you are parsing arbitrary HTML documents, this shouldn't be used, but in my case, the HTML content is under my control so no problems there. – coderforlife Jan 03 '23 at 18:23
2

soup.title.string actually returns a unicode string. To convert that into normal string, you need to do string=string.encode('ascii','ignore')

Joe
  • 15,205
  • 8
  • 49
  • 56
Sai Kiriti Badam
  • 950
  • 16
  • 15
  • 1
    That will just remove any non ascii characters which probably isn't what you want. If you really want bytes (what `encode` gives) and not a string, encode with the correct `charset`. e.g., `string.encode('utf-8')`. – reubano Jan 10 '17 at 13:25
2

Here is a fault tolerant HTMLParser implementation.
You can throw pretty much anything at get_title() without it breaking, If anything unexpected happens get_title() will return None.
When Parser() downloads the page it encodes it to ASCII regardless of the charset used in the page ignoring any errors. It would be trivial to change to_ascii() to convert the data into UTF-8 or any other encoding. Just add an encoding argument and rename the function to something like to_encoding().
By default HTMLParser() will break on broken html, it will even break on trivial things like mismatched tags. To prevent this behavior I replaced HTMLParser()'s error method with a function that will ignore the errors.

#-*-coding:utf8;-*-
#qpy:3
#qpy:console

''' 
Extract the title from a web page using
the standard lib.
'''

from html.parser import HTMLParser
from urllib.request import urlopen
import urllib

def error_callback(*_, **__):
    pass

def is_string(data):
    return isinstance(data, str)

def is_bytes(data):
    return isinstance(data, bytes)

def to_ascii(data):
    if is_string(data):
        data = data.encode('ascii', errors='ignore')
    elif is_bytes(data):
        data = data.decode('ascii', errors='ignore')
    else:
        data = str(data).encode('ascii', errors='ignore')
    return data


class Parser(HTMLParser):
    def __init__(self, url):
        self.title = None
        self.rec = False
        HTMLParser.__init__(self)
        try:
            self.feed(to_ascii(urlopen(url).read()))
        except urllib.error.HTTPError:
            return
        except urllib.error.URLError:
            return
        except ValueError:
            return

        self.rec = False
        self.error = error_callback

    def handle_starttag(self, tag, attrs):
        if tag == 'title':
            self.rec = True

    def handle_data(self, data):
        if self.rec:
            self.title = data

    def handle_endtag(self, tag):
        if tag == 'title':
            self.rec = False


def get_title(url):
    return Parser(url).title

print(get_title('http://www.google.com'))
Ricky Wilson
  • 3,187
  • 4
  • 24
  • 29
2

In Python3, we can call method urlopen from urllib.request and BeautifulSoup from bs4 library to fetch the page title.

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.google.com")
soup = BeautifulSoup(html, 'lxml')
print(soup.title.string)

Here we are using the most efficient parser 'lxml'.

S Habeeb Ullah
  • 968
  • 10
  • 15
0

Using lxml...

Getting it from page meta tagged according to the Facebook opengraph protocol:

import lxml.html.parse
html_doc = lxml.html.parse(some_url)

t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]

or using .xpath with lxml:

t = html_doc.xpath(".//title")[0].text
markling
  • 1,232
  • 1
  • 15
  • 28