regular expressions: extract text between two markers

Question

I'm trying to write an Python parser to extract some information from html-pages.

It should extract text from between <p itemprop="xxx"> and </p>

I use regular expression:

m = re.search(ur'p>(?P<text>[^<]*)</p>', html)

but it can't parse file if it is another tags between them. For example:

<p itemprop="xxx"> some text <br/> another text </p>

As I understood [^<] is exception only for one symbol. How to write "everything except </p>" ?

Use an HTML parser, such as [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/). Regex is not a suitable tool for this kind of parsing. — Robert Harvey, Aug 17 '14 at 21:12
See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Rnhmjoj, Aug 17 '14 at 21:28

enrico.bacis · Accepted Answer · 2014-08-18T20:33:07.507

2

You can use:

m = re.search(ur'p>(?P<text>.*?)</p>', html)

This is a lazy match, it will match everything until </p>. You should also consider using an HTML parser like BeautifulSoup which, after installation, can be used with CSS Selectors like this:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
m = soup.select('p[itemprop="xxx"]')

edited Aug 18 '14 at 20:33

answered Aug 17 '14 at 21:16

enrico.bacis

30,497
10
86
115

A small correction: `.*` is a greedy match. `.*?` is a non-greedy match. You have correctly specified `.*?`, but incorrectly described it. – Robᵩ Aug 18 '14 at 20:29

score 1 · Answer 2 · answered Aug 18 '14 at 02:53

1) Never use regular expressions to parse HTML.

2) The following regular expression will work some of the time, on some HTML:

#!/usr/bin/python2.7

import re

pattern = ur'''
    (?imsx)             # ignore case, multiline, dot-matches-newline, verbose
    <p.*?>              # match first marker
    (?P<text>.*?)       # non-greedy match anything
    </p.*?>             # match second marker
'''

print re.findall(pattern, '<p>hello</p>')
print re.findall(pattern, '<p>hello</p> and <p>goodbye</p>')
print re.findall(pattern, 'before <p>hello</p> and <p><i>good</i>bye</p> after')
print re.findall(pattern, '<p itemprop="xxx"> some text <br/> another text </p>')

As another answer pointed out, .*? is the non-greedy pattern which matches any character.

regular expressions: extract text between two markers

2 Answers2