0

I am currently using SAS version 9 to try and read in a flat file in .txt format of a HTML table that I have taken from the following page (entitled Wayne Rooney's Match History):

http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney I've got the data into a .txt file using a Python webscraper using Scrapy. The format of my .txt file is like thus:

17-08-2013,1 : 4,Swansea,Manchester United,28',7.26,Assist Assist,26-08-2013,0 : 0,Manchester United,Chelsea,90',7.03,None,14-09-2013,2 : 0,Manchester United,Crystal Palace,90',8.44,Man of the Match Goal,17-09-2013,4 : 2,Manchester United,Bayer Leverkusen,84',9.18,Goal Goal Assist,22-09-2013,4 : 1,Manchester City,Manchester United,90',7.17,Goal Yellow Card,25-09-2013,1 : 0,Manchester United,Liverpool,90',None,Man of the Match Assist,28-09-2013,1 : 2,Manchester United,West Bromwich Albion,90'...

...and so on. What I want is a dataset that has the same format as the original table. I know my way round SAS fairly well, but tend not to use infile statements all that much. I have tried a few variations on a theme, but this syntax has got me the nearest to what I want so far:

filename myfile "C:\Python27\Football Data\test.txt";

    data test;
    length date $10.
           score $6.
           home_team $40.
           away_team $40.
           mins_played $3.
           rating $4.
           incidents $40.;

    infile myfile DSD;
    input date $
          score $
          home_team $
          away_team $
          mins_played $
          rating $
          incidents $ ;
run;

This returns a dataset with only the first row of the table included. I have tried using fixed widths and pointers to set the dataset dimensions, but because the length of things like team names can change so much, this is causing the data to be reassembled from the flat file incorrectly.

I think I'm most of the way there, but can't quite crack the last bit. If anyone knows the exact syntax I need that would be great.

Thanks

gdogg371
  • 3,879
  • 14
  • 63
  • 107
  • Any particular reason you don't want to just read the file directly into SAS from the web? – Joe Jul 17 '14 at 01:34

1 Answers1

2

I would read it straight from the web. Something like this; this works about 50% but took a whole ten minutes to write, i'm sure it could be easily improved.

Basic approach is you use @'string' to read in text following a string. You might be better off reading this in as a bytestream and doing a regular expression match on <tr> ... </tr> and then parsing that rather than taking the sort of more brute force method here.

filename rooney url "http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney" lrecl=32767;
data rooney;
infile rooney scanover;
retain are_reading;
input @;
if find(_infile_,'<table id="player-fixture" class="grid fixture">')
    then are_reading=1;
if find(_infile_,'</table>') then are_reading=0;
if are_reading then do;
    input @'<td class="date">' date ddmmyy10. 
          @'class="team-link">' home_team $20.
          @'class="result-1 rc">' score $10.
          @'class="team-link">' away_team $20.
          @'title="Minutes played in this match">' mins_played $10.
          @'title="Rating in this match">' rating $6.
    ;
output;
end;
run;

As far as reading the scrapy output, you should change at least two things:

  • Add the delimiter. Not truly necessary, but I'd consider the code incorrect without it, unless delimiter is space.
  • Add a trailing "@@" to get SAS to hold the line pointer, since you don't have line feeds in your data.

    data want; infile myfile flowover dlm=',' dsd lrecl=32767; length date $10. score $6. home_team $40. away_team $40. mins_played $3. rating $4. incidents $40.; input date $ score $ home_team $ away_team $ mins_played $ rating $ incidents $ @@; run;

Flowover is default, but I like to include it to make it clear.

You also probably want to input the date as a date value (not char), so informat date ddmmyy10.;. The rating is also easily input as numeric if you want to, and both mins played and score could be input as numeric if you're doing analysis on those by adding ' and : to the delimiter list.

Finally, your . on length is incorrect; SAS is nice enough to ignore it, but . is only placed like so for formats.

Here's my final code:

data want;
infile "c:\temp\test2.txt" flowover dlm="',:"  lrecl=32767;
informat date ddmmyy10.
           score_1 score_2 2.
           home_team $40.
           away_team $40.
           mins_played 3.
           rating 4.2
           incidents $40.;
    input date 
          score_1 
          score_2
          home_team $
          away_team $
          mins_played 
          rating ??
          incidents $ @@;
run;

I remove the dsd as that's incompatible with the ' delimiter; if DSD is actually needed then you can add it back, remove that delimiter, and read minutes in as char. I add ?? for rating as it sometimes is "None" so ?? ignores the warnings about that.

Joe
  • 62,789
  • 6
  • 49
  • 67
  • 1
    Parsing HTML with regex? Sounds like choppy waters! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Allan Bowe Jul 17 '14 at 07:32
  • @joe i've read stuff in from the web directly before using SAS and it works, but it is far, far too slow for multiple submissions. python is an actual web development language and also gives the option of easily writing scalable code, which SAS does not. do you know how to read my flat file in using a SAS infile statement? – gdogg371 Jul 17 '14 at 10:13
  • @RawFocus Hardly applicable here. – Joe Jul 17 '14 at 14:15
  • @user3045351 SAS is every bit as scalable as Python, if you know what you're doing, and it shouldn't be much slower than python unless you're doing a very large number of pulls (millions?). As for reading your flat file, i'll update the answer. – Joe Jul 17 '14 at 14:25