I use regular expression to extract info from a web page in python, but fail when I meet "return"?

Question

I want to get www.target.com and target. The following code works:

#coding: utf8
import re

info = '''<a href="www.target.com">  xxxxxx   <span>target</span>'''

result = re.findall(r'<a href="(.*?)".+?<span>(.*?)</span>', info)
print result

But when I meet a str, with lot of return and char, like:

info = '''<a href="www.target.com"> # return here
xxxxxxxx                            # return here
xxxx                                # return here
xxxxxx   <span>target</span>'''

In this situation, How can I get the link www.target.com and word target using regular expression in Python?

You can't -- regular expressions are incapable of parsing HTML. You will need to use an HTML or XML parser library instead. — The Paramagnetic Croissant, Oct 18 '14 at 08:52
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — The Paramagnetic Croissant, Oct 18 '14 at 08:52
Can you come up with *one* good reason for using regular expressions here? Python has a wonderful HTML parser ([BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)), use that instead. — Tomalak, Oct 18 '14 at 09:16

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1

Just turn on the DOTALL mode (re.DOTALL) to make dot in your regex to match even newline characters also.

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Code:

result = re.findall(r'<a href="(.*?)".+?<span>(.*?)</span>', info, re.DOTALL)

Example:

>>> import re
>>> info = '''<a href="www.target.com"> # return here
... xxxxxxxx                            # return here
... xxxx                                # return here
... xxxxxx   <span>target</span>'''
>>> re.findall(r'<a href="(.*?)".+?<span>(.*?)</span>', info, re.DOTALL)
[('www.target.com', 'target')]

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 18 '14 at 08:52

Avinash Raj

172,303
28
230
274

1

You really should not recommend/provide solutions that parse HTML with regular expressions. – Tomalak Oct 18 '14 at 09:18
1

@Tomalak i agree but some high rep users also posted regex solutions to these type of questions . See http://stackoverflow.com/a/26428308/3297613 – Avinash Raj Oct 18 '14 at 09:25
Yes, and that is equally questionable. (the downvote is not mine, btw) – Tomalak Oct 18 '14 at 09:30
The answer really solved my problem. Thank you.I read another answer about using `re` to parse Html, `it's sometimes appropriate to parse a limited, known set of HTML` – liyuhao Oct 18 '14 at 11:25

I use regular expression to extract info from a web page in python, but fail when I meet "return"?

1 Answers1