2

How can I match strings between abcd="_blank"> and </a> using Regex in Python 2.7. For example for abcd="_blank">ABBA</a> the result should be ABBA.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
TJ1
  • 7,578
  • 19
  • 76
  • 119
  • 2
    [Please read this answer carefully](http://stackoverflow.com/a/1732454/918959) – Antti Haapala -- Слава Україні Jan 31 '15 at 05:25
  • Addendum, of course, in simple cases it could be possible to use regular expression to match something between tags, if it is always in the same format in the source code, but the mere fact that you asked, shows me that your regex-fu is not strong enough to know when and especially when not, a regular expression does not work. None of the answers claiming contrary are right, the simplest thing that can parse any fragment of HTML 100 % correctly is a HTML parser. – Antti Haapala -- Слава Україні Jan 31 '15 at 07:24

1 Answers1

4

What about using an HTML Parser, for example, BeautifulSoup:

from bs4 import BeautifulSoup

data = """
<div>
    <a xyz="_blank">NO MATCH 1</a>
    <a abcd="_blank">ABBA</a>
    <a>NO MATCH 2</a>
</div>
"""

soup = BeautifulSoup(data)
for a in soup.find_all('a', abcd='_blank'):
    print(a.text)

Prints ABBA.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Woow this works very well. Can you please explain how does it work? – TJ1 Jan 31 '15 at 05:32
  • 1
    @TJ1 as Antti already mentioned, please read the information about why you should not use regular expressions for parsing HTML. `BeautifulSoup` is a specialized tool for parsing HTML and very easy to use and understand. Here we are finding all `a` tags having `abcd` attribute that equals to `_blank`; for each tag found, we are getting the tag text. Hope this makes sense. – alecxe Jan 31 '15 at 05:38
  • Notice also, that a HTML parser can give you the right thing too, even if the text contained a `<` character escaped to `<`, or tags. – Antti Haapala -- Слава Україні Jan 31 '15 at 07:25