0

I am having difficulty creating two columns, "Home Score" and "Away Score", in the wikipedia table I am trying to parse.

I tried the following script with two try-except-else statements to see if that would work.

test_matches = pd.read_html('https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results')

test_matches = test_matches[1]

test_matches['Year'] = test_matches['Date'].str[-4:].apply(pd.to_numeric)
test_matches_worst = test_matches[(test_matches['Winner'] != 'Wales') & (test_matches['Year'] >= 2007) & (test_matches['Competition'].str.contains('Nations'))]

try:
  test_matches_worst['Home Score'] = test_matches_worst['Score'].str.split("–").str[0].apply(pd.to_numeric)
except:
  print("let's try again")
else:
  test_matches_worst['Home Score'] = test_matches_worst['Score'].str.split("-").str[0].apply(pd.to_numeric)

try:
  test_matches_worst['Away Score'] = test_matches_worst['Score'].str.split("–").str[1].apply(pd.to_numeric)
except:
  print("let's try again")
else: 
  test_matches_worst['Away Score'] = test_matches_worst['Score'].str.split("-").str[1].apply(pd.to_numeric)

test_matches_worst['Margin'] = (test_matches_worst['Home Score'] - test_matches_worst['Away Score']).abs()

test_matches_worst.sort_values('Margin', ascending=False).reset_index(drop = True)#.head(20)

However, I would receive a Key error message and the "Home Score" is not displayed in the dataframe when shortening the code. What is the best way to handle this particular table and to generate the columns that I want? Any assistance on this would be greatly appreciated. Thanks in advance.

BLuta
  • 243
  • 1
  • 10

1 Answers1

0

The problem of the data you collect is the hyphen or dash. Except the last row, all score separator are the 'En Dash' (U+2013) and not the 'Hyphen' (U+002D):

sep = r'[-\u2013]'

# df is test_matches_worst
df[['Home Score','Away Score']] = df['Score'].str.split(sep, expand=True).astype(int)
df['Margin'] = df['Home Score'].sub(df['Away Score']).abs

Output:

>>> df[['Score', 'Home Score', 'Away Score', 'Margin']]
     Score  Home Score  Away Score  Margin
565   9–19           9          19      10
566   21–9          21           9      12
567  32–21          32          21      11
568  23–20          23          20       3
593  21–16          21          16       5
595  15–17          15          17       2
602  30–17          30          17      13
604  20–26          20          26       6
605  27–12          27          12      15
614  19–26          19          26       7
618   28–9          28           9      19
644  22–30          22          30       8
656   26–3          26           3      23
658  29–18          29          18      11
666  16–21          16          21       5
679  16–16          16          16       0
682  25–21          25          21       4
693  16–21          16          21       5
694  29–13          29          13      16
696  20–18          20          18       2
704   12–6          12           6       6
705  37–27          37          27      10
732  24–14          24          14      10
733  23–27          23          27       4
734  33–30          33          30       3
736  10–14          10          14       4
737   32–9          32           9      23
739  13–24          13          24      11
745  32–30          32          30       2
753   29-7          29           7      22

Note: you will probably receive a SettingWithCopyWarning

To solve it, use test_matches = test_matches[1].copy()

Bonus

Pandas function like to_datetime, to_timedelta or to_numeric can take a Series as parameter so you can avoid apply:

test_matches['Year'] = pd.to_numeric(test_matches['Date'].str[-4:])
Corralien
  • 109,409
  • 8
  • 28
  • 52