extract data from website using python

Question

I recently started learning python and one of the first projects I did was to scrap updates from my son's classroom web page and send me notifications that they updated the site. This turned out to be an easy project so I wanted to expand on this and create a script that would automatically check if any of our lotto numbers hit. Unfortunately I haven't been able to figure out how to get the data from the website. Here is one of my attempts from last night.

from bs4 import BeautifulSoup
import urllib.request

webpage = "http://www.masslottery.com/games/lottery/large-winningnumbers.html"

websource = urllib.request.urlopen(webpage)
soup = BeautifulSoup(websource.read(), "html.parser")

span = soup.find("span", {"id": "winning_num_0"})
print (span)

Output is here...
<span id="winning_num_0"></span>

The output listed above is also what I see if I "view source" with a web browser. When I "inspect Element" with the web browser I can see the winning numbers in the inspect element panel. Unfortunately I'm not even sure how/where the web browser is getting the data. is it loading from another page or a script in the background? I thought the following tutorial was going to help me but I wasn't able to get the data using similar commands.

http://zevross.com/blog/2014/05/16/using-the-python-library-beautifulsoup-to-extract-data-from-a-webpage-applied-to-world-cup-rankings/

Any help is appreciated. Thanks

if the content is dynamic, you might need an approach based on, e.g., Selenium - http://selenium-python.readthedocs.io/api.html — ewcz, Sep 15 '16 at 12:21
Possible duplicate of [Reading dynamically generated web pages using python](http://stackoverflow.com/questions/13960567/reading-dynamically-generated-web-pages-using-python) — Sandeep, Sep 15 '16 at 12:24
Checking from the developer console what that page does, it loads the data dynamically from here: http://www.masslottery.com/data/json/games/lottery/recent.json So you could just write a script that loads that json-formatted data and checks the numbers from there. A lot easier than scraping html ;) — lari, Sep 15 '16 at 12:25
Selenium is definitely the approach that I would recommend in most cases, but you're lucky here - the static approach is actually even *easier* than what you were trying to do in the first place :) — Wayne Werner, Sep 15 '16 at 12:36
Thanks for the quick replies. I will try both the static and dynamic approach since this is more of a learning project. — gameoverman, Sep 15 '16 at 12:55

score 2 · Accepted Answer · answered Sep 15 '16 at 12:34

2

If you look closely at the source of the page (I just used curl) you can see this block

<script type="text/javascript">
    // <![CDATA[
    var dataPath = '../../';
    var json_filename = 'data/json/games/lottery/recent.json';
    var games = new Array();
    var sessions = new Array();
    // ]]>
</script>

That recent.json stuck out like a sore thumb (I actually missed the dataPath part at first).

After giving that a try, I came up with this:

curl http://www.masslottery.com/data/json/games/lottery/recent.json

Which, as lari points out in the comments, is way easier than scraping HTML. This easy, in fact:

import json
import urllib.request
from pprint import pprint

websource = urllib.request.urlopen('http://www.masslottery.com/data/json/games/lottery/recent.json')
data = json.loads(websource.read().decode())
pprint(data)

data is now a dict, and you can do whatever kind of dict-like things you'd like to do with it. And good luck ;)

answered Sep 15 '16 at 12:34

Wayne Werner

49,299
29
200
290

Thank you. i will try this tonight! – gameoverman Sep 15 '16 at 12:54
For added fun, you could always use python's `random` module to guess lotto numbers and see how much money it would make you. – Wayne Werner Sep 15 '16 at 14:32
Your solution worked. Now I need to figure out how to easily extract the information from the dictionary since it is multi-level. – gameoverman Sep 16 '16 at 12:48
If it worked, then you should mark this as accepted by clicking the green checkmark to the left <---. For multi-level dictionaries you can simply chain `[]`s, e.g. `data['foo']['bar']` – Wayne Werner Sep 16 '16 at 13:52

extract data from website using python

1 Answers1