1
  1. Go to this url https://www.horseracebase.com/horse-racing-results.php?year=2005&month=3&day=15 (username = TrickyBen | password = TrickyBen123)
  2. Notice that there is a Download Excel button (in Red)
  3. I want to download the excel file and turn it into a pandas dataframe. I want to do it programatically (ie. from the script, not by manually clicking around the website). How would I do this?

This code will get you logged in as TrickyBen, and make a request to the website API...

import requests from lxml import html from requests import Session import pandas as pd import shutil

raceSession = Session()

LoginDetails = {'login': 'TrickyBen', 'password': 'TrickyBen123'}

LoginUrl = 'https://www.horseracebase.com/horse-racing-results.php?year=2005&month=3&day=15/horsebase1.php'
LoginPost = raceSession.post(LoginUrl, data=LoginDetails)

RaceUrl = 'https://www.horseracebase.com/excelresults.php'
RaceDataDetails =  {"user": "41495", "racedate": "2005-3-15", "downloadbutton": "Excel"}

PostHeaders = {"Content-Type": "application/x-www-form-urlencoded"}
Response = raceSession.post(RaceUrl, data=RaceDataDetails, headers=PostHeaders)

Table = pd.read_table(Response.text)

Table.to_csv('blahblah.csv')

If you inspect element, you'll notice that the relevant element looks like this...

<form action="excelresults.php" method="post">
    <input type="hidden" name="user" value="41495">
    <input type="hidden" name="racedate" value="2005-3-15">
    <input type="submit" class="downloadbutton" value="Excel">
</form>

I get this error message...

Traceback (most recent call last):
  File "/Users/Alex/Desktop/DateTest/hrpull.py", line 20, in <module>
    Table = pd.read_table(Response.text)
  File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 799, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
  File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 1213, in __init__
self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 358, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3427)
  File "pandas/parser.pyx", line 628, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6861)
IOError: File race_date race_time   track   race_name       race_restrictions_age   race_class  major   race_distance   prize_money     going_description   number_of_runners   place   distbt  horse_name  stall       trainer horse_age   jockey_name jockeys_claim   pounds  odds    fav     official_rating comptime    TotalDstBt  MedianOR    Dist_Furlongs       placing_numerical   RCode   BFSP    BFSP_Place  PlcsPaid    BFPlcsPaid      Yards   RailMove    RaceType    
"2005-03-15"    "14:00:00"  "Cheltenham"    "Letheby & Christopher Supreme Novices Hurdle " "4yo+"  "Class 1"   "Grade 1"   "2m˝f " "58000" "Good"  "20"    "1st"       "Arcalis"   "0" "Johnson, J Howard" "5" "Lee, G"    "0" "161"   "21"        "136"   "3 mins 53.00s"     "121.5" "16.5"  "1" "National Hunt" "0" "0" "3" "0" "0" "0" "Novices Hurdle"
"2005-03-15"    "14:00:00"  "Cheltenham"    "Letheby & Christopher Supreme Novices Hurdle " "4yo+"  "Class 1"   "Grade 1"   "2m˝f " "58000" "Good"  "20"    "2nd"   "6" "Wild Passion (GER)"    "0" "Meade, Noel"   "5" "Carberry, P"   "0" "161"   "11"        "0" "3 mins 53.00s" "6" "121.5" "16.5"  "2" "National Hunt" "0" "0" "3" "0" "0" "0" "Novices Hurdle"
BennyBoe
  • 11
  • 1
  • 3
  • So you want to download the info without downloading the file? – TidyDev Oct 21 '17 at 22:24
  • Hey - I want to download the file programatically. Ie. not simply by manually downloading the file to my desktop and then using read_csv to read the file into DataFrame. Of course, read_csv will get the file into the DataFrame, but I want to access the file from the script. I hope this makes sense? – BennyBoe Oct 21 '17 at 22:49

2 Answers2

0

I'm thinking that you can see the data that you want to download in another web page, for example, by clicking on "My Systems (v4)". If you can do that, then you can use urllib.request.urlretrieve to download that page. And then you can use html.parser.HTMLParser to parse the data and do with as you wish.

John Anderson
  • 35,991
  • 4
  • 13
  • 36
0

If you would look at the api being called on the form action, you'll see that you've to make a post request to this url :

https://www.horseracebase.com/excelresults.php

with following parameters:

data = {
    "user": "41495", # looks like this varies with login, so update in case you change your login id
    "racedate": "2005-3-15",
    "downloadbutton": "Excel"
}

you can do something like this:

response = raceSession.post(reqUrl, json=data)

If this doesn't work, try adding headers to the request like: headers=postHeaders. For ex. you should set the content type header in this case, as you're sending form encoded data, so:

headers = {"Content-Type": "application/x-www-form-urlencoded"} 

Read this for more info on how to save the excel to a file

Here's the response for this request in Postman, so looks like you won't need any additional headers except the content-type :

enter image description here

EDIT

This is what you need to do:

raceSession = Session()

RaceUrl = 'https://www.horseracebase.com/excelresults.php'
RaceDataDetails =  {"user": "41495", "racedate": "2005-3-15", "downloadbutton": "Excel"}

PostHeaders = {"Content-Type": "application/x-www-form-urlencoded"}
Response = raceSession.post(RaceUrl, data=RaceDataDetails, headers=PostHeaders)
# from StringIO import StringIO #for python 2.x
#import StringIO #for python 3.x
Table = pd.read_table(StringIO(Response.text)) 
Ashish Ranjan
  • 5,523
  • 2
  • 18
  • 39
  • Hey guys, thanks for this. I've made some progress but my code doesn't work. Are you able to please see what I'm doing wrong? – BennyBoe Oct 22 '17 at 08:37
  • raceSession = Session() LoginDetails = {'login': 'TrickyBen', 'password': 'TrickyBen123'} LoginUrl = 'https://www.horseracebase.com/horse-racing-results.php?year=2005&month=3&day=15/horsebase1.php' LoginPost = raceSession.post(LoginUrl, data=LoginDetails) RaceUrl = 'https://www.horseracebase.com/excelresults.php' RaceDataDetails = {"user": "41495", "racedate": "2005-3-15", "downloadbutton": "Excel"} PostHeaders = {"Content-Type": "application/x-www-form-urlencoded"} Response = raceSession.post(RaceUrl, json=RaceDataDetails, headers=PostHeaders) Table = pd.read_table(Response) – BennyBoe Oct 22 '17 at 08:39
  • @BennyBoe can you please add this to your question – Ashish Ranjan Oct 22 '17 at 08:57
  • you're passing Response directly to pandas, you should first save it to a file and then try to read it. Check the answer for more info on saving to file. – Ashish Ranjan Oct 22 '17 at 09:01
  • Thanks I've updated the code in the answer above. The error message I get now is as follows...

    You must hold a valid HorseRaceBase membership to access this feature. Sign up for a free trial now

    I'm not sure why I get this message, because I had already logged in on the same session.
    – BennyBoe Oct 22 '17 at 09:29
  • change `json=` to `data=`. working for me in python as well, tested right now. also use `pd.read_table(Response.text)` – Ashish Ranjan Oct 22 '17 at 10:39
  • super! the file comes back, but still having problems though. are you able to please paste the script? i've pasted the script i use above - alongside the error message that i get. – BennyBoe Oct 22 '17 at 10:56
  • added @BennyBoe – Ashish Ranjan Oct 22 '17 at 11:19
  • thanks so much, for you help and patience :-) . i've voted up, but it says something about me needing a better reputation (ie. reputation > 15). vote is recorded but not displayed. – BennyBoe Oct 22 '17 at 11:48
  • you can mark the answer as correct, helps future visitors with similar problems – Ashish Ranjan Oct 22 '17 at 11:52