How do I get numerical data while web scraping?

Question

I'm completely new to web scraping, so any reference sites would be great. I am slightly confused as to how I'm getting the actual data. When I print(theText), I get a bunch of html code (which should be correct). How do I exactly go about getting values from this? Do I have to use regular expressions to get the actual numerical data?

def getData():
    request = urllib.request.Request("http://www.weather.com/weather/5day/l/USGA0028:1:US")
    response = urllib.request.urlopen(request)
    the_page = response.read()
    theText = the_page.decode()
    print(theText)

score 5 · Accepted Answer · answered Jun 26 '15 at 22:05

5

Have a look at BeautifulSoup. It allows you to get elements by their IDs or tags. It is very useful for basic scraping.
You can just call beutiful soup with the response text (html page) and then you can call the bs methods

answered Jun 26 '15 at 22:05

Lawrence Benson

1,398
1
16
33

Thank you for the website! I am, however, doing a homework assignment which requires the use of regex. This is the reason why I'm having a lot of trouble finding a website to explain the basics. – Shan Jun 26 '15 at 23:00
2

this should help for python https://docs.python.org/2/library/re.html this for regular expressions in general http://regexone.com/ – Lawrence Benson Jun 26 '15 at 23:05

score 0 · Answer 2 · edited May 23 '17 at 10:24

0

no, you shouldn't use RegExp for HTML. Instead. Have a look at BeatifulSoup4

edited May 23 '17 at 10:24

Community

1
1

answered Jun 26 '15 at 22:06

plasmid0h

186
2
13

Thanks! I am, however, doing a homework assignment which requires the use of regex. This is the reason why I'm having a lot of trouble finding a website to explain the basics. – Shan Jun 26 '15 at 23:00

How do I get numerical data while web scraping?

2 Answers2