Use Python re to get rid of links

Question

Say I have a string looks like <a href="/wiki/Greater_Boston" title="Greater Boston">Boston–Cambridge–Quincy, MA–NH MSA</a>

How can I use re to get rid of links and get only the Boston–Cambridge–Quincy, MA–NH MSA part?

I tried something like match = re.search(r'<.+>(\w+)<.+>', name_tmp) but not working.

score 3 · Accepted Answer · edited May 23 '17 at 11:49

3

re.sub('<a[^>]+>(.*?)</a>', '\\1', text)

Note that parsing HTML in general is rather dangerous. However it seems that you are parsing MediaWiki generated links where it is safe to assume that the links are always similar formatted, so you should be fine with that regular expression.

edited May 23 '17 at 11:49

Community

1
1

answered Feb 23 '13 at 23:43

poke

369,085
72
557
602

score 3 · Answer 2 · answered Feb 24 '13 at 00:21

3

You can also use the bleach module https://pypi.python.org/pypi/bleach , which wraps html sanitizing tools and lets you quickly strip text of html

answered Feb 24 '13 at 00:21

Jonathan Vanasco

15,111
10
48
72

Use Python re to get rid of links

2 Answers2