Extract part of a regex match

Question

I want a regular expression to extract the title from a HTML page. Currently I have this:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '')

Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?

wow I can't believe all the responses calling to parse the entire HTML page just to extract a simple title. What overkill! — hoju, Aug 27 '09 at 02:02
Question title says it all - the example given _happens_ to be HTML, but the general problem is ... general. — Phil, May 24 '17 at 23:30

score 361 · Accepted Answer · edited Jun 15 '20 at 06:27

361

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

edited Jun 15 '20 at 06:27

UselesssCat

2,248
4
20
35

answered Aug 25 '09 at 10:29

Krzysztof Krasoń

26,515
16
89
115

1

If you're not doing anything when no title is found, why would it be a bad thing to use group() directly? (you can catch the exception anyway) – tonfa Aug 25 '09 at 10:52
3

yeah, but most people forget about exceptions, and are really surprised when they see them at runtime :) – Krzysztof Krasoń Aug 25 '09 at 18:30
5

Don't forget to run `import re` or else you'll get `NameError: name 're' is not defined` – Powers Mar 26 '20 at 18:59

Xavier Guihot · Answer 2 · 2022-06-28T03:58:53.253

65

Note that starting in Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

edited Jun 28 '22 at 03:58

answered Apr 27 '19 at 15:06

Xavier Guihot

54,987
21
291
190

5

Oh, that's pretty. – EdwardG Apr 11 '22 at 17:16

score 12 · Answer 3 · answered Aug 25 '09 at 10:30

12

Try using capturing groups:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

answered Aug 25 '09 at 10:30

Aaron Maenpaa

119,832
11
95
108

score 9 · Answer 4 · answered Mar 01 '13 at 19:22

9

May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

answered Mar 01 '13 at 19:22

kharagpur

231
3
9

I would like to add, that beautifulsoup also parses incomplete html, and that's really nice. – endre Oct 21 '13 at 07:52

score 7 · Answer 5 · answered Aug 25 '09 at 10:28

7

Try:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

answered Aug 25 '09 at 10:28

Randy

3,972
19
25

If you really want to use REGEX for HTML parsing, don't run .group() directly on match, since it may return None. – iElectric Aug 25 '09 at 10:37
2

You should use `.*?` so in case there are multiple `` in the document (unlikely but you never knows). – tonfa Aug 25 '09 at 10:41
@iElectric: you could put it in a try except block if you really want, right? – tonfa Aug 25 '09 at 10:45

score 6 · Answer 6 · edited Mar 05 '20 at 00:49

6

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

edited Mar 05 '20 at 00:49

MarredCheese

17,541
8
92
91

answered Aug 25 '09 at 10:28

Vinay Sajip

95,872
14
179
191

Jim Dennis · Answer 7 · 2021-10-02T18:49:40.493

I'd think this should suffice:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

... assuming that your text (HTML) is in a variable named "text."

This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.

However ...

Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).

If you're handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is widely recommended for this purpose.

Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.

What is `re.MULTILINE` supposed to do here? It changes beginning-of-line `^` and end-of-line `$`, both of which you do not use. — bers, Oct 30 '20 at 08:20

score 4 · Answer 8 · answered Oct 27 '13 at 14:07

The provided pieces of code do not cope with Exceptions May I suggest

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

This returns an empty string by default if the pattern has not been found, or the first match.

bers · Answer 9 · 2020-10-30T08:22:49.120

The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags).

I therefore propose the following improvement:

import re

def search_title(html):
    m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
    return m.group(1) if m else None

Test cases:

print(search_title("<title   >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))

Output:

with spaces in tags
with newline in tags
first of two titles
with newline
  in title

Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.

score 2 · Answer 10 · answered May 20 '21 at 14:00

I needed something to match package-0.0.1 (name, version) but want to reject an invalid version such as 0.0.010.

See regex101 example.

import re

RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')

example = 'hello-0.0.1'

if match := RE_IDENTIFIER.search(example):
    name, version = match.groups()
    print(f'Name:     {name}')
    print(f'Version:  {version}')
else:
    raise ValueError(f'Invalid identifier {example}')

Output:

Name:     hello
Version:  0.0.1

score 1 · Answer 11 · answered May 18 '21 at 21:48

Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=<\/title>) works great. It will only match whats between parentheses so you don't have to do the whole group thing.

Extract part of a regex match

11 Answers11

Linked

Related