5

I would like to scrape a table from the Ligue 1 football website. Specifically the table which contains information on cards and referees.

http://www.ligue1.com/LFPStats/stats_arbitre?competition=D1

I am using the following code:

import requests
from bs4 import BeautifulSoup
import csv

r=requests.get("http://www.ligue1.com/LFPStats/stats_arbitre?competition=D1")

soup= BeautifulSoup(r.content, "html.parser")
table=soup.find_all('table')

This returns another table somewhere else in the html. I have tried to circumnavigate this by using [0], [1] etc after the find all function but return nothing. I have also searched for tr and td but get similar results. I have no idea why beautiful soup ignores this table.

The table I am looking for is in the HTML code below

<table>
<thead>
  <tr>
    <th class="{sorter: false} hide position">Position</th>
    <th class="{sorter: false} joueur">Referees</th>
    <th class="chiffre header"><span class="icon icon_carton_jaune">Yellow card</span></th>
    <th class="chiffre header"><span class="icon icon_carton_rouge">Red card</span></th>
    <th class="chiffre header">Matches</th>
  </tr>
</thead>
    <tbody><tr>
  <td class="position"></td>
  <td class="joueur">Benoît BASTIEN</td>
  <td class="chiffre"><a href="/stats_arbitre_details/245">25</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/245">4</a></td>
  <td class="chiffre">8</td>
</tr>
    <tr class="odd">
  <td class="position"></td>
  <td class="joueur">Hakim BEN EL HADJ</td>
  <td class="chiffre"><a href="/stats_arbitre_details/259">55</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/259">4</a></td>
  <td class="chiffre">10</td>
</tr>
    <tr>
  <td class="position"></td>
  <td class="joueur">Wilfried BIEN</td>
  <td class="chiffre"><a href="/stats_arbitre_details/162">44</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/162">3</a></td>
  <td class="chiffre">9</td>
</tr>
    <tr class="odd">
  <td class="position"></td>
  <td class="joueur">Ruddy BUQUET</td>
  <td class="chiffre"><a href="/stats_arbitre_details/269">33</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/269">2</a></td>
  <td class="chiffre">7</td>
</tr>
    <tr>
  <td class="position"></td>
  <td class="joueur">Tony CHAPRON</td>
  <td class="chiffre"><a href="/stats_arbitre_details/102">43</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/102">1</a></td>
  <td class="chiffre">8</td>
</tr>
    <tr class="odd">
  <td class="position"></td>
  <td class="joueur">Amaury DELERUE</td>
  <td class="chiffre"><a href="/stats_arbitre_details/343">30</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/343">0</a></td>
  <td class="chiffre">6</td>
</tr>
    <tr>
  <td class="position"></td>
  <td class="joueur">Saïd ENNJIMI</td>
  <td class="chiffre"><a href="/stats_arbitre_details/113">27</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/113">1</a></td>
  <td class="chiffre">6</td>
</tr>
    <tr class="odd">
  <td class="position"></td>
  <td class="joueur">Fredy FAUTREL</td>
  <td class="chiffre"><a href="/stats_arbitre_details/338">25</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/338">2</a></td>
  <td class="chiffre">8</td>
</tr>
    <tr>
  <td class="position"></td>
  <td class="joueur">Antony GAUTIER</td>
  <td class="chiffre"><a href="/stats_arbitre_details/331">31</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/331">8</a></td>
  <td class="chiffre">9</td>
</tr>
    <tr class="odd">
  <td class="position"></td>
  <td class="joueur">Johan HAMEL</td>
  <td class="chiffre"><a href="/stats_arbitre_details/334">43</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/334">7</a></td>
  <td class="chiffre">9</td>
</tr>
    <tr>
  <td class="position"></td>
  <td class="joueur">Lionel JAFFREDO</td>
  <td class="chiffre"><a href="/stats_arbitre_details/124">40</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/124">2</a></td>
  <td class="chiffre">9</td>
</tr>
    <tr class="odd">
  <td class="position"></td>
  <td class="joueur">Stéphane JOCHEM</td>
  <td class="chiffre"><a href="/stats_arbitre_details/294">33</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/294">4</a></td>
  <td class="chiffre">8</td>
</tr>
    <tr>
  <td class="position"></td>
  <td class="joueur">Stéphane LANNOY</td>
  <td class="chiffre"><a href="/stats_arbitre_details/127">24</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/127">0</a></td>
  <td class="chiffre">6</td>
</tr>
    <tr class="odd">
  <td class="position"></td>
  <td class="joueur">Mikael LESAGE</td>
  <td class="chiffre"><a href="/stats_arbitre_details/286">38</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/286">3</a></td>
  <td class="chiffre">9</td>
</tr>
    <tr>
  <td class="position"></td>
  <td class="joueur">Jérôme MIGUELGORRY</td>
  <td class="chiffre"><a href="/stats_arbitre_details/239">32</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/239">1</a></td>
  <td class="chiffre">10</td>
</tr>
    <tr class="odd">
  <td class="position"></td>
  <td class="joueur">Benoît MILLOT</td>
  <td class="chiffre"><a href="/stats_arbitre_details/287">43</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/287">0</a></td>
  <td class="chiffre">11</td>
</tr>
    <tr>
  <td class="position"></td>
  <td class="joueur">Sébastien MOREIRA</td>
  <td class="chiffre"><a href="/stats_arbitre_details/148">38</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/148">5</a></td>
  <td class="chiffre">10</td>
</tr>
    <tr class="odd">
  <td class="position"></td>
  <td class="joueur">Nicolas RAINVILLE</td>
  <td class="chiffre"><a href="/stats_arbitre_details/188">40</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/188">7</a></td>
  <td class="chiffre">10</td>
</tr>
    <tr>
  <td class="position"></td>
  <td class="joueur">Frank SCHNEIDER</td>
  <td class="chiffre"><a href="/stats_arbitre_details/247">33</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/247">4</a></td>
  <td class="chiffre">10</td>
</tr>
    <tr class="odd">
  <td class="position"></td>
  <td class="joueur">Clément TURPIN</td>
  <td class="chiffre"><a href="/stats_arbitre_details/333">26</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/333">3</a></td>
  <td class="chiffre">8</td>
</tr>
    <tr>
  <td class="position"></td>
  <td class="joueur">Bartolomeu VARELA</td>
  <td class="chiffre"><a href="/stats_arbitre_details/288">35</a></td>
  <td class="chiffre"><a href="/stats_arbitre_details/288">3</a></td>
  <td class="chiffre">9</td>
</tr>
</tbody></table>

I have also tried searching for td with a specific class as well which should work but it can't pick out the table in the first place.

2 Answers2

2

The problem is that (i assume) you are watching the HTML code generated by the browser, and what you are missing is that the table is appended to the page using javascript.

You can confirm this using chrome (or any other browser), and instead of "Inspect", look for "View Page Source", and you will notice that there is no such table in the server response.

The URL it calls is "http://www.ligue1.com/stats_arbitre?competition=D1", but there is a trick, you must indicate via http headers, that the request is a XHR. If you try in the browser with this URL, you'll get 500 response.

Try this curl example to check is the table you want.

curl --header "X-Requested-With: XMLHttpRequest" http://www.ligue1.com/stats_arbitre?competition=D1

In your code, do this:

import requests
from bs4 import BeautifulSoup
import csv

headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('http://www.ligue1.com/stats_arbitre?competition=D1', headers=headers)

...

Hope it helps

  • Hello, thank you that was extremely helpful. I entered the curl example into my terminal and it pulled out the HTML script I was trying to get when I was using the inspect option on the browser. After your suggested modifications to my code though, I still can't pull out the HTML I am after. Should I try and perform a curl using Python or is there another solution? Apologies for my naivety I am quite new to this. – Richard Hudson Dec 16 '15 at 11:15
  • I also only get response 200 – Richard Hudson Dec 16 '15 at 13:14
  • it seems pretty weird, i just tested again and it worked, did you changed the url from `http://www.ligue1.com/LFPStats/stats_arbitre?competition=D1` to `http://www.ligue1.com/stats_arbitre?competition=D1` ? – Nodiel Clavijo Llera Dec 17 '15 at 15:00
  • That works perfectly, Thanks I hadn't spotted that. Is there any chance you could explain why removing /LFPStats from the url has this effect? – Richard Hudson Dec 18 '15 at 13:02
  • That's because, the webpage is located under `http://www.ligue1.com/LFPStats/stats_arbitre?competition=D1`, in this page, there is an AJAX request to `http://www.ligue1.com/stats_arbitre?competition=D1` where the table is returned and then inserted through javascript. This happens very often when scraping the web, so, do not trust in "Inspect element", always go for "View page source" and the "Network" tab in chrome dev tools, so you can watch all the requests. – Nodiel Clavijo Llera Dec 21 '15 at 20:54
0

Selenium can do it.

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
htmlSource = driver.page_source
xingpei Pang
  • 1,185
  • 11
  • 15