0

I'm trying to scrape some game files off a chess site in Python and I've run into a problem. My plan is to lift off all the game ids from the html and plug them into some url to download them. The hard part is actually getting the game ids.

The relevant html looks something like this:

<a class="games right-4" href="/livechess/game?id=1012106017"> View</a>
<a class="games right-4" href="/livechess/game?id=982464559"> View</a>
<a class="games right-4" href="/livechess/game?id=1011988271"> View</a>

I'm interested in the id=... part. Also, there are no other occurrences beginning with /livechess/... in the page.

How can I extract these ids using regular expressions? I've tried reading up some stuff but it's confusing me more than it's helping.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
walela
  • 63
  • 2
  • 6

2 Answers2

2

Don't use a regular expression to parse HTML. Use a HTML parser instead. With BeautifulSoup this task is as easy as:

for link in soup.select('a[href^=/livechess/game?id=]'):
    print link['href']

getting just the id from that could be done with string splitting:

link_id = link['href'].partition('id=')[-1]

Demo with a live page:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.chess.com/members/view/MagnusCarlsen')
>>> soup = BeautifulSoup(r.content)
>>> for link in soup.select('a[href^=/livechess/game?id=]'):
...     print link['href']
... 
/livechess/game?id=998801933
/livechess/game?id=998801191
/livechess/game?id=998801076
/livechess/game?id=998801451
/livechess/game?id=998801336
/livechess/game?id=998801799
/livechess/game?id=998801568
/livechess/game?id=998800852
/livechess/game?id=998802049
/livechess/game?id=998800982
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks,Martijn!How would I use BeautifulSoup to get the html markup?Initially I had used urllib to open the url and read the html. – walela Dec 28 '14 at 12:28
  • @walela: I added a demo using [`requests`](http://docs.python-requests.org/en/latest/). – Martijn Pieters Dec 28 '14 at 12:28
  • @walela: also see [retrieve links from web page using python and beautiful soup](http://stackoverflow.com/q/1080411) ([my answer there](http://stackoverflow.com/a/22583436) covers BeautifulSoup 4). – Martijn Pieters Dec 28 '14 at 12:30
0

A combination of regex and BeautifulSoup.

In [14]: for i in soup.find_all('a', href=re.compile("^/livechess/game\?id=")):
    ...:         print(re.split(r'id=', i['href'])[1])
    ...:     
998801933
998801191
998801076
998801451
998801336
998801799
998801568
998800852
998802049
998800982
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274