1

How to extract the up vote (215) and Down vote (82) count from the following html snippet using python regular expression?

<span class="vote-actions">
    <a class="btn btn-default vote-action-good">
        <span class="icon thumb-up black black-hover">&nbsp;</span>
        <span class="rating-inbtn">215</span>
    </a>
    <a class="btn btn-default vote-action-bad">
        <span class="icon thumb-down grey black-hover">&nbsp;</span>
        <span class="rating-inbtn">82</span>
    </a>
</span>

I have formatted the html Code but there is no '\n' or '\t' character present in the original code.

FYI i am not expecting any beautiful soup solution. Python Re search function is what I am looking for.

martineau
  • 119,623
  • 25
  • 170
  • 301
coolsaint
  • 1,291
  • 2
  • 16
  • 27

2 Answers2

2

To find both number I would do

text = '''<span class="vote-actions">
    <a class="btn btn-default vote-action-good">
        <span class="icon thumb-up black black-hover">&nbsp;</span>
        <span class="rating-inbtn">215</span>
    </a>
    <a class="btn btn-default vote-action-bad">
        <span class="icon thumb-down grey black-hover">&nbsp;</span>
        <span class="rating-inbtn">82</span>
    </a>
</span>'''

import re

a = re.findall('rating-inbtn">(\d+)', text)
print(a)

['215', '82']

In HTML I see that first number is Up and second is Down so I don't need better method.

up = a[0]
down = a[1]

If it is not enough then I would use HTML parser

text = '''<span class="vote-actions">
    <a class="btn btn-default vote-action-good">
        <span class="icon thumb-up black black-hover">&nbsp;</span>
        <span class="rating-inbtn">215</span>
    </a>
    <a class="btn btn-default vote-action-bad">
        <span class="icon thumb-down grey black-hover">&nbsp;</span>
        <span class="rating-inbtn">82</span>
    </a>
</span>'''

import lxml.html

soup = lxml.html.fromstring(text)

up = soup.xpath('//a[@class="btn btn-default vote-action-good"]/span[@class="rating-inbtn"]')
up = up[0].text
print(up)

down = soup.xpath('//a[@class="btn btn-default vote-action-bad"]/span[@class="rating-inbtn"]')
down = down[0].text
print(down)
furas
  • 134,197
  • 12
  • 106
  • 148
  • Thanks. I was actually trying to modify an extractor of youtube-dl package and they have two helper functions one is _html_search_regex and another one is _search_regex. Those are performing a regex search on the given string, using a single or a list of patterns returning the first matching group. I believe I can integrate the findall function too. I liked the trick. – coolsaint Apr 07 '19 at 00:49
2

don't use regex to parse html https://stackoverflow.com/a/1732454/412529

here's how to do it with BeautifulSoup:

html = '''<span class="vote-actions">...'''
import bs4
soup = bs4.BeautifulSoup(html)
soup.select("a.vote-action-good span.rating-inbtn")[0].text  # '215'
soup.select("a.vote-action-bad span.rating-inbtn")[0].text  # '82'
jnnnnn
  • 3,889
  • 32
  • 37