Find string between two sets of string (python / urllib2 / beautiful soup)

Question

I have the following source code of a web page web that I am trying to parse data from

<span class="reviewCount">
<a href="...Reviews-WHATIWANT-City..." target="_blank" onclick="XX;">1,361 reviews</a>
</span>

EDIT (with beautiful soup):

To extract this information I parse the data using beautiful soup. I use the following code:

spans = soup.findAll('span', attrs={"class":u"reviewCount"})
for span in spans:
a = span.find('a')
print re.search('(?<=Reviews-)(.*?)(?=-City)', a.get('href'))

but I get this information

<_sre.SRE_Match object at 0x7f84fce05300>
<_sre.SRE_Match object at 0x7f84fce05300>
<_sre.SRE_Match object at 0x7f84fce05300>
<_sre.SRE_Match object at 0x7f84fce05300>

and not the bytes between "Reviews-" and "-City". Could anyone assist me in finding the right syntax? Thanks.

Why not use [BeautifulSoup](https://stackoverflow.com/questions/11709079/parsing-html-python) to parse HTML instead of trying to use regex, which is [widely considered a bad idea](https://blog.codinghorror.com/parsing-html-the-cthulhu-way/) — Cory Kramer, Jan 07 '16 at 16:31
I will look at it thanks. I will be happy to know how to do it with Regex anyway though. — Bastien, Jan 07 '16 at 16:39
You could just avoid regexes altogether and use the Python `HTMLParser` library. (It's called `html.parser` in Python 3.) — ddsnowboard, Jan 07 '16 at 16:54
`re.findall(r'(?s)reviewCount(.*?)/a',str(data))` - but only if you always have both `reviewCount` and `/a` and they are not separated with tags and they are unique. And the substring in between them is not too long. Too many ifs to use a regex with HTML. — Wiktor Stribiżew, Jan 07 '16 at 17:28
Thanks. I have used beautiful soup and edited the question accordingly. I manage to retrieve span, a and the href but I still struggle when trying to get bytes of the href element. — Bastien, Jan 07 '16 at 21:41

score 0 · Accepted Answer · answered Jan 08 '16 at 17:24

re.search() returns a "match" object. You need to get the saving group value if there is a match:

spans = soup.find_all('span', attrs={"class":u"reviewCount"})
for span in spans:
    a = span.find('a')
    match = re.search(r'Reviews\-(.*?)\-City', a.get('href'))
    if match:
        print(match.group(1))

Find string between two sets of string (python / urllib2 / beautiful soup)

1 Answers1