I have written the following code to scrape data from a website (e.g. https://www.oddsportal.com/soccer/new-zealand/football-championship/hamilton-canterbury-GhUEDiE0/). The data in question are the over/under values that can be found in the pages HTML Code:
<tr class="lo odd">
<td>
<div class="l"><a class="name2" title="Go to Pinnacle website!" onclick="return !window.open(this.href)" href="/bookmaker/pinnacle/link/"><span class="blogos l18"></span></a> <a class="name" title="Go to Pinnacle website!" onclick="return !window.open(this.href)"
href="/bookmaker/pinnacle/link/">Pinnacle</a> </div><span class="ico-bookmarker-info ico-bookmaker-detail"><a title="Show more details about Pinnacle" href="/bookmaker/pinnacle/"></a></span></td>
<td class="center">+0.5</td>
<td class="right odds">
<div class=" deactivateOdd" onmouseout="delayHideTip()" onmouseover="page.hist(this,'P-0.50-0-0','4j5hgx1tkucx1ix0',18,event,0,1)">1.10</div>
</td>
<td class="right odds up-dark">
<div class=" deactivateOdd" onmouseout="delayHideTip()" onmouseover="page.hist(this,'P-0.50-0-0','4j5hgx1tl1gx1ix0',18,event,0,1)">7.85</div>
</td>
<td class="center info-value"><span>-</span></td>
<td onmouseout="delayHideTip()" class="check ch1" xparam="The match has already started~2"></td>
</tr>
The interesting part is the over/under values, for example here 1.10, 7.85. This data is scraped and arranged in a data frame:
master_df= pd.DataFrame()
for match in self.all_links:
#for match in links:
self.openmatch(match)
self.clickou()
self.expandodds()
for x in range(1,28):
L = []
bookmakers=['Asianodds','Pinnacle']
#odds_type=fi2('//*[@id="odds-data-table"]/div{}/div/strong/a'.format(x))
if x==1:
over_under_type= 'Over/Under +0.5'
elif x==4:
over_under_type= 'Over/Under +1'
elif x==6:
over_under_type= 'Over/Under +1.5'
elif x==8:
over_under_type= 'Over/Under +1.75'
elif x==9:
over_under_type= 'Over/Under +2'
elif x==10:
over_under_type= 'Over/Under +2.25'
elif x==11:
over_under_type= 'Over/Under +2.5'
elif x==13:
over_under_type= 'Over/Under +2.75'
elif x==14:
over_under_type= 'Over/Under +3'
elif x==16:
over_under_type= 'Over/Under +3.5'
elif x==19:
over_under_type= 'Over/Under +4'
elif x==21:
over_under_type= 'Over/Under +4.5'
elif x==26:
over_under_type= 'Over/Under +5.5'
elif x==28:
over_under_type= 'Over/Under +6.5'
for j in range(1,15): # only first 10 bookmakers displayed
Book = self.ffi('//*[@id="odds-data-table"]/div[{}]/table/tbody/tr[{}]/td[1]/div/a[2]'.format(x,j)) # first bookmaker name
Odd_1 = self.fffi('//*[@id="odds-data-table"]/div[{}]/table/tbody/tr[{}]/td[3]/div'.format(x,j)) # first home odd
Odd_2 = self.fffi('//*[@id="odds-data-table"]/div[{}]/table/tbody/tr[{}]/td[4]/div'.format(x,j)) # first away odd
match = self.ffi('//*[@id="col-content"]/h1') # match teams
final_score = self.ffi('//*[@id="event-status"]')
date = self.ffi('//*[@id="col-content"]/p[1]') # Date and time
print(match, Book, Odd_1, Odd_2, date, final_score, link, over_under_type, '/ 500 ')
L = L + [(match, Book, Odd_1, Odd_2, date, final_score, link, over_under_type)]
data_df = pd.DataFrame(L)
try:
data_df.columns = ['TeamsRaw', 'Bookmaker', 'Over', 'Under', 'DateRaw' ,'ScoreRaw','Link','Over Under Type']
except:
print('Function crashed, probable reason : no games scraped (empty season)')
master_df=pd.concat([master_df,data_df])
My issue is that with this code the execution takes me something like 5 minutes per iteration to execute. I am now trying to make the program more performant. I guess there might be a more elegant way to achieve this than having all those for loops? I need them in order to get the correct "div" for each xpath. I would be glad for some recommendations!