0

Say I have a string looks like <a href="/wiki/Greater_Boston" title="Greater Boston">Boston–Cambridge–Quincy, MA–NH MSA</a>

How can I use re to get rid of links and get only the Boston–Cambridge–Quincy, MA–NH MSA part?

I tried something like match = re.search(r'<.+>(\w+)<.+>', name_tmp) but not working.

clwen
  • 20,004
  • 31
  • 77
  • 94

2 Answers2

3
re.sub('<a[^>]+>(.*?)</a>', '\\1', text)

Note that parsing HTML in general is rather dangerous. However it seems that you are parsing MediaWiki generated links where it is safe to assume that the links are always similar formatted, so you should be fine with that regular expression.

Community
  • 1
  • 1
poke
  • 369,085
  • 72
  • 557
  • 602
3

You can also use the bleach module https://pypi.python.org/pypi/bleach , which wraps html sanitizing tools and lets you quickly strip text of html

Jonathan Vanasco
  • 15,111
  • 10
  • 48
  • 72