0

I have the following source code of a web page web that I am trying to parse data from

<span class="reviewCount">
<a href="...Reviews-WHATIWANT-City..." target="_blank" onclick="XX;">1,361 reviews</a>
</span>

EDIT (with beautiful soup):

To extract this information I parse the data using beautiful soup. I use the following code:

spans = soup.findAll('span', attrs={"class":u"reviewCount"})
for span in spans:
a = span.find('a')
print re.search('(?<=Reviews-)(.*?)(?=-City)', a.get('href'))

but I get this information

<_sre.SRE_Match object at 0x7f84fce05300>
<_sre.SRE_Match object at 0x7f84fce05300>
<_sre.SRE_Match object at 0x7f84fce05300>
<_sre.SRE_Match object at 0x7f84fce05300>

and not the bytes between "Reviews-" and "-City". Could anyone assist me in finding the right syntax? Thanks.

Bastien
  • 596
  • 1
  • 11
  • 30
  • 3
    Why not use [BeautifulSoup](https://stackoverflow.com/questions/11709079/parsing-html-python) to parse HTML instead of trying to use regex, which is [widely considered a bad idea](https://blog.codinghorror.com/parsing-html-the-cthulhu-way/) – Cory Kramer Jan 07 '16 at 16:31
  • I will look at it thanks. I will be happy to know how to do it with Regex anyway though. – Bastien Jan 07 '16 at 16:39
  • You could just avoid regexes altogether and use the Python `HTMLParser` library. (It's called `html.parser` in Python 3.) – ddsnowboard Jan 07 '16 at 16:54
  • `re.findall(r'(?s)reviewCount(.*?)/a',str(data))` - but only if you always have both `reviewCount` and `/a` and they are not separated with tags and they are unique. And the substring in between them is not too long. Too many ifs to use a regex with HTML. – Wiktor Stribiżew Jan 07 '16 at 17:28
  • Thanks. I have used beautiful soup and edited the question accordingly. I manage to retrieve span, a and the href but I still struggle when trying to get bytes of the href element. – Bastien Jan 07 '16 at 21:41

1 Answers1

0

re.search() returns a "match" object. You need to get the saving group value if there is a match:

spans = soup.find_all('span', attrs={"class":u"reviewCount"})
for span in spans:
    a = span.find('a')
    match = re.search(r'Reviews\-(.*?)\-City', a.get('href'))
    if match:
        print(match.group(1))
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195