Python parsing data from a website using regular expression

Question

I'm trying to parse some data from this website: http://www.csfbl.com/freeagents.asp?leagueid=2237

I've written some code:

import urllib
import re

name = re.compile('<td><a href="[^"]+" onclick="[^"]+">(.+?)</a>')
player_id = re.compile('<td><a href="(.+?)" onclick=')
#player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=')
stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">')
stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>')

url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237'

sock = urllib.request.urlopen(url).read().decode("utf-8")

#li = name.findall(sock)
name = name.findall(sock)
player_id = player_id.findall(sock)
#player_id_num = player_id_num.findall(sock)
#age = age.findall(sock)
stat_c = stat_c.findall(sock)
stat_p = stat_p.findall(sock)

First question : player_id returns the whole url "player.asp?playerid=4209661". I was unable to get just the number part. How can I do that? (my attempt is described in #player_id_num)

Second question: I am not able to get stat_c when span_class is empty as in "".

Is there a way I can get these resolved? I am not very familiar with RE (regular expressions), I looked up tutorials online but it's still unclear what I am doing wrong.

[Have you considered using an HTML parser?](http://stackoverflow.com/a/1732454/3001761) — jonrsharpe, Jun 03 '15 at 00:15
You started with one problem and decided to use regular expressions. Now you have two problems. — Jim Garrison, Jun 03 '15 at 02:02

WGS · Answer 1 · 2015-06-03T04:23:13.570

Very simple using the pandas library.

Code:

import pandas as pd

url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)

# print dfs[3]
# dfs[3].to_csv("stats.csv") # Send to a CSV file.
print dfs[3].head()

Result:

    0                 1    2  3      4     5     6     7     8      9     10  \
0  Pos              Name  Age  T     PO    FI    CO    SY    HR     RA    GL   
1    P    George Pacheco   38  R   4858  7484  8090  7888  6777   4353  6979   
2    P     David Montoya   34  R   3944  5976  6673  8699  6267   6685  5459   
3    P       Robert Cole   34  R   5769  7189  7285  5863  6267   5868  5462   
4    P  Juanold McDonald   32  R  69100  5772  4953  4866  5976  67100  5362   

     11    12  13       14      15          16  
0    AR    EN  RL  Fatigue  Salary         NaN  
1  3747  6171  -3     100%     ---  $3,672,000  
2  5257  5975  -4      96%      2%  $2,736,000  
3  4953  5061  -4      96%      3%  $2,401,000  
4  5982  5263  -4     100%     ---  $1,890,000

You can apply whatever cleaning methods you want from here onwards. Code is rudimentary so it's up to you to improve it.

More Code:

import pandas as pd
import itertools

url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
df = dfs[3] # "First" stats table.

# The first row is the actual header.
# Also, notice the NaN at the end.
header = df.iloc[0][:-1].tolist()
# Fix that atrocity of a last column.
df.drop([15], axis=1, inplace=True)

# Last row is all NaNs. This particular
# table should end with Jeremy Dix.
df = df.iloc[1:-1,:]
df.columns = header
df.reset_index(drop=True, inplace=True)

# Pandas cannot create two rows without the
# dataframe turning into a nightmare. Let's
# try an aesthetic change.
sub_header = header[4:13]
orig = ["{}{}".format(h, "r") for h in sub_header]
clone = ["{}{}".format(h, "p") for h in sub_header]

# http://stackoverflow.com/a/3678930/2548721
comb = [iter(orig), iter(clone)]
comb = list(it.next() for it in itertools.cycle(comb))

# Construct the new header.
new_header = header[0:4]
new_header += comb
new_header += header[13:]

# Slow but does it cleanly.
for s, o, c in zip(sub_header, orig, clone):
    df.loc[:, o] = df[s].apply(lambda x: x[:2])
    df.loc[:, c] = df[s].apply(lambda x: x[2:])

df = df[new_header] # Drop the other columns.

print df.head()

More result:

  Pos              Name Age  T POr  POp FIr FIp COr COp     ...      RAp GLr  \
0   P    George Pacheco  38  R  48   58  74  84  80  90     ...       53  69   
1   P     David Montoya  34  R  39   44  59  76  66  73     ...       85  54   
2   P       Robert Cole  34  R  57   69  71  89  72  85     ...       68  54   
3   P  Juanold McDonald  32  R  69  100  57  72  49  53     ...      100  53   
4   P      Trevor White  37  R  61   66  62  64  67  67     ...       38  48   

  GLp ARr  ARp ENr ENp  RL Fatigue      Salary  
0  79  37   47  61  71  -3    100%  $3,672,000  
1  59  52   57  59  75  -4     96%  $2,736,000  
2  62  49   53  50  61  -4     96%  $2,401,000  
3  62  59   82  52  63  -4    100%  $1,890,000  
4  50  70  100  62  69  -4    100%  $1,887,000

Obviously, what I did instead was separate the Real values from Potential values. Some tricks were used but it gets the job done at least for the first table of players. The next few ones require a degree of manipulation.

this is excellent!!!! However, the PO FI CO SY numbers are actually 2 rows. How can I make panda spit out two rows? — Euisoo Yoo, Jun 03 '15 at 01:01
You can't, at least not without some index manipulation. But you can parse the columns to split the numbers, something like: `48|58` or somesuch. — WGS, Jun 03 '15 at 01:09
This has me interested. Let me see what I can do about it further. — WGS, Jun 03 '15 at 01:13

Python parsing data from a website using regular expression

1 Answers1