1

I am currently trying to scrape my internet providers data usage. I tried looking for an api of sorts but they don't have one. I am resorting to scraping the html whch looks like this

</tr><tr class="top-border"><td>17&nbsp;&nbsp;Monday</td><td class='text-right'><span class='mb'>2,991.69&nbsp;MB</span><span class='gb'>2.92&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>1,232.04&nbsp;MB</span><span class='gb'>1.20&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>4,223.73&nbsp;MB</span><span class='gb'>4.12&nbsp;GB</span></td>         <td>
            <div class="progress"><div class="bar bar-success" style="width: 51%;"></div></div>         </td>

        </tr><tr><td>18&nbsp;&nbsp;Tuesday</td><td class='text-right'><span class='mb'>3,589.42&nbsp;MB</span><span class='gb'>3.51&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>1,199.58&nbsp;MB</span><span class='gb'>1.17&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>4,789.00&nbsp;MB</span><span class='gb'>4.68&nbsp;GB</span></td>           <td>
            <div class="progress"><div class="bar bar-success" style="width: 57%;"></div></div>         </td>

ect

I tried to use pythons re.search but I can only get a bit of info out of it. eg:

search = re.search("class='gb'>(.*)&nbsp;GB</span>",raw_info)
for i in range(0,100):
    try:
        print(search.group(i))
    except:
        break

output:

class='gb'>6.88&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>
1,295.90&nbsp;MB</span><span class='gb'>1.27&nbsp;GB</span></td></td><td class='
text-right'><span class='mb'>8,340.12&nbsp;MB</span><span class='gb'>8.14&nbsp;G
B</span>
6.88&nbsp;GB</span></td></td><td class='text-right'><span class='mb'>1,295.90&nb
sp;MB</span><span class='gb'>1.27&nbsp;GB</span></td></td><td class='text-right'
><span class='mb'>8,340.12&nbsp;MB</span><span class='gb'>8.14

I learned I can't use groups like that to print out all of the numbers

tldr: I need to print all the numbers referring to gb and print them like this

2.92,1.20,4.12

3.51,1.17,4.68

Community
  • 1
  • 1
John Smith
  • 347
  • 1
  • 11
  • Word of advice, never use regex on HTML. See [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) answer – Wondercricket Oct 28 '16 at 20:46

1 Answers1

3

You might want to try using BeautifulSoup, it's a very flexible library which can do exactly what you are looking for.

html = scraped
soup = BeautifulSoup(html)
spans = soup.findAll('span', attrs={'class': 'gb'})

You will then have a list of all the span tags that have the gb class. Producing the numbers and converting them to floats then applying whatever format you want to print them in is fairly simple.

Ziyad Edher
  • 2,150
  • 18
  • 31