0

I am trying to get the data of Premier League table from https://www.premierleague.com/tables . I am able to get the data through the code below, but unfortunately it only works for the latest season option (2018/2019). The page offers tables for other seasons as well (2017/2018, ...), how can I scrape the other tables?

from lxml import html
import requests

page = requests.get('https://www.premierleague.com/tables')

tree = html.fromstring( page.content )

team_rows = tree.xpath('//table//tbody//tr[@data-filtered-table-row-name]')[0:20]
team_names = [i.attrib['data-filtered-table-row-name'] for i in team_rows] 

teams = {}

for i in range(20):
    element = team_rows[i]
    teams[team_names[i]] = element.getchildren()

for i in team_names:
    values = [j.text_content() for j in teams[i]]
    row = "{} "*9
    print( row.format(i, *values[3:12] ) )
  • 1
    Open that page in chrome and then open your Network tab. Now when you change the season you can see what request chrome makes to get that data. – pguardiario Jan 02 '19 at 00:59
  • the requests are from `https://footballapi.pulselive.com/football/standings`, but I don't get the reason why can include that..and also for the `params`, and `headers` parameters for `requests.get`. –  Jan 02 '19 at 12:10

1 Answers1

0

but unfortunately it only works for the latest season option (2018/2019)

Website is using JavaScript to load the old table(1992-2017), so when you use Python to access that you gain latest table. If you desire to scrape the table filter by year/session, i provide a hard code version for you(Because i did not found the rule of year number). But you want to complete it more elegantly, selenium or requests_html might suit for you.

Note: Im imitating JavaScript gain data from server, so the response's content is json type. And it can only gain different year's Premier League table. Filter by competition/matchweek/home_or_away is not available in my example. If you want to add those option into script, you should analysis the rule of url parameter(use the way @pguardiario said or use some tools like fiddler).

import requests
from pprint import pprint

years = {str(1991+i):str(i) for i in range(1,23)}
years.update({
    "2018":"210",
    "2017":"79",
    "2016":"54",
    "2015":"42",
    "2014":"27"
    })

specific = years.get("2017")

param = {
    "altIds":"true",
    "compSeasons":specific,
    "detail":2,
    "FOOTBALL_COMPETITION":1
}

headers = {
    "Origin": "https://www.premierleague.com",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
    "Referer": "https://www.premierleague.com/tables?co=1&se={}&ha=-1".format(specific),
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"
    }

page = requests.get('https://footballapi.pulselive.com/football/standings',
                 params=param,
                 headers=headers
                 )
print(page.url)
pprint(page.json())

How to get different tables from one page

I feel your question title is different from you description. If it is true, The other issue is you combine all table into one. And you should be care of // What is meaning of .// in XPath?.

Note: If you want to get old data of Premier League table, use my code in 1st part. Because those data can only be gotten from that way.

from lxml import html
import requests
from pprint import pprint

years = {str(1991+i):str(i) for i in range(1,23)}
years.update({
    "2018":"210",
    "2017":"79",
    "2016":"54",
    "2015":"42",
    "2014":"27"
    })

param = {
    "co":"1",
    "se":years.get("2017"),
    "ha":"-1"
}


page = requests.get('https://www.premierleague.com/tables', params=param)

tree = html.fromstring( page.content )
tables = tree.xpath('//tbody[contains(@class,"tableBodyContainer")]')
each_table_team_rows = [table.xpath('tr[@data-filtered-table-row-name]') for table in tables]
team_names = [[i.attrib['data-filtered-table-row-name'] for i in team_rows] for team_rows in each_table_team_rows]

pprint(team_names)
KC.
  • 2,981
  • 2
  • 12
  • 22
  • Using the `requests.get('https://footballapi.pulselive.com/football/standings', params=param, headers=headers )` works, but using param = `{'co': '1', 'se': '21', 'ha': '-1}` still returns when `se=21` (which is the default season). The 2nd answer does not work, I also followed that from http://docs.python-requests.org/en/master/user/quickstart/#make-a-request –  Jan 02 '19 at 12:06
  • When i modified first part i forget to edit second part, *apart from latest season you need to use my 1st example to gain it.* actually, my 2nd part is to answer your title question, because your question title looks like asking another question. – KC. Jan 02 '19 at 12:37
  • Why my 2nd part code doesn't work, because the actual way to get match data is through the url in my 1st part. – KC. Jan 02 '19 at 12:46
  • The 2nd part only results the default table, regardless the value of `'se'`.. (which means it does not work to solve the problem) If it is only to get the default, then other years are not needed.. –  Jan 02 '19 at 13:00
  • Yes, my 2nd part just show how to extract data with `XPAHT`. *If you want to get data you have to used my 1st part.* Because it is the website decide it. Of course, *if you insist using XPATH you have to use `selenium`*, since i have tried to use requests_html and got unexpected result. – KC. Jan 02 '19 at 13:12
  • I see `'footballapi.pulselive.com/football/competitions?page=0&pageSize=100&detail=2'` in the Chrome, instead of `'https://footballapi.pulselive.com/football/standings'`.. Although, both worked –  Jan 02 '19 at 13:30
  • It's great news, but table's data you want is inside `/football/standings` – KC. Jan 02 '19 at 13:37