2

I'm trying to write an Python parser to extract some information from html-pages.

It should extract text from between <p itemprop="xxx"> and </p>

I use regular expression:

m = re.search(ur'p>(?P<text>[^<]*)</p>', html)

but it can't parse file if it is another tags between them. For example:

<p itemprop="xxx"> some text <br/> another text </p>

As I understood [^<] is exception only for one symbol. How to write "everything except </p>" ?

vitperov
  • 1,347
  • 17
  • 20

2 Answers2

2

You can use:

m = re.search(ur'p>(?P<text>.*?)</p>', html)

This is a lazy match, it will match everything until </p>. You should also consider using an HTML parser like BeautifulSoup which, after installation, can be used with CSS Selectors like this:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
m = soup.select('p[itemprop="xxx"]')
enrico.bacis
  • 30,497
  • 10
  • 86
  • 115
  • A small correction: `.*` is a greedy match. `.*?` is a non-greedy match. You have correctly specified `.*?`, but incorrectly described it. – Robᵩ Aug 18 '14 at 20:29
1

1) Never use regular expressions to parse HTML.

2) The following regular expression will work some of the time, on some HTML:

#!/usr/bin/python2.7

import re

pattern = ur'''
    (?imsx)             # ignore case, multiline, dot-matches-newline, verbose
    <p.*?>              # match first marker
    (?P<text>.*?)       # non-greedy match anything
    </p.*?>             # match second marker
'''

print re.findall(pattern, '<p>hello</p>')
print re.findall(pattern, '<p>hello</p> and <p>goodbye</p>')
print re.findall(pattern, 'before <p>hello</p> and <p><i>good</i>bye</p> after')
print re.findall(pattern, '<p itemprop="xxx"> some text <br/> another text </p>')

As another answer pointed out, .*? is the non-greedy pattern which matches any character.

Robᵩ
  • 163,533
  • 20
  • 239
  • 308