0

I have written the following code to scrape data from a website (e.g. https://www.oddsportal.com/soccer/new-zealand/football-championship/hamilton-canterbury-GhUEDiE0/). The data in question are the over/under values that can be found in the pages HTML Code:

            <tr class="lo odd">
                <td>
                    <div class="l"><a class="name2" title="Go to Pinnacle website!" onclick="return !window.open(this.href)" href="/bookmaker/pinnacle/link/"><span class="blogos l18"></span></a>&nbsp;<a class="name" title="Go to Pinnacle website!" onclick="return !window.open(this.href)"
                            href="/bookmaker/pinnacle/link/">Pinnacle</a>&nbsp;&nbsp;</div><span class="ico-bookmarker-info ico-bookmaker-detail"><a title="Show more details about Pinnacle" href="/bookmaker/pinnacle/"></a></span></td>
                <td class="center">+0.5</td>
                <td class="right odds">
                    <div class=" deactivateOdd" onmouseout="delayHideTip()" onmouseover="page.hist(this,'P-0.50-0-0','4j5hgx1tkucx1ix0',18,event,0,1)">1.10</div>
                </td>
                <td class="right odds up-dark">
                    <div class=" deactivateOdd" onmouseout="delayHideTip()" onmouseover="page.hist(this,'P-0.50-0-0','4j5hgx1tl1gx1ix0',18,event,0,1)">7.85</div>
                </td>
                <td class="center info-value"><span>-</span></td>
                <td onmouseout="delayHideTip()" class="check ch1" xparam="The match has already started~2"></td>
            </tr>

The interesting part is the over/under values, for example here 1.10, 7.85. This data is scraped and arranged in a data frame:

    master_df= pd.DataFrame()

    for match in self.all_links:
    #for match in links:

        self.openmatch(match)
        self.clickou()
        self.expandodds()   
        for x in range(1,28):
            L = []
            bookmakers=['Asianodds','Pinnacle']

                #odds_type=fi2('//*[@id="odds-data-table"]/div{}/div/strong/a'.format(x))
            if x==1:
                over_under_type= 'Over/Under +0.5'
            elif x==4:
                over_under_type= 'Over/Under +1'
            elif x==6:
                over_under_type= 'Over/Under +1.5'
            elif x==8:
                over_under_type= 'Over/Under +1.75'
            elif x==9:
                over_under_type= 'Over/Under +2'  
            elif x==10:
                over_under_type= 'Over/Under +2.25'
            elif x==11:
                over_under_type= 'Over/Under +2.5'
            elif x==13:
                over_under_type= 'Over/Under +2.75'
            elif x==14:
                over_under_type= 'Over/Under +3' 
            elif x==16:
                over_under_type= 'Over/Under +3.5'  
            elif x==19:
                over_under_type= 'Over/Under +4'
            elif x==21:
                over_under_type= 'Over/Under +4.5'
            elif x==26:
                over_under_type= 'Over/Under +5.5'
            elif x==28:
                over_under_type= 'Over/Under +6.5' 

            for j in range(1,15): # only first 10 bookmakers displayed
                Book = self.ffi('//*[@id="odds-data-table"]/div[{}]/table/tbody/tr[{}]/td[1]/div/a[2]'.format(x,j)) # first bookmaker name
                Odd_1 = self.fffi('//*[@id="odds-data-table"]/div[{}]/table/tbody/tr[{}]/td[3]/div'.format(x,j)) # first home odd
                Odd_2 = self.fffi('//*[@id="odds-data-table"]/div[{}]/table/tbody/tr[{}]/td[4]/div'.format(x,j)) # first away odd
                match = self.ffi('//*[@id="col-content"]/h1') # match teams
                final_score = self.ffi('//*[@id="event-status"]')
                date = self.ffi('//*[@id="col-content"]/p[1]') # Date and time
                print(match, Book, Odd_1, Odd_2, date, final_score, link, over_under_type, '/ 500 ')
                L = L + [(match, Book, Odd_1, Odd_2, date, final_score, link, over_under_type)]
                data_df = pd.DataFrame(L)

                try:
                    data_df.columns = ['TeamsRaw', 'Bookmaker', 'Over', 'Under', 'DateRaw' ,'ScoreRaw','Link','Over Under Type']
                except:
                    print('Function crashed, probable reason : no games scraped (empty season)')
                master_df=pd.concat([master_df,data_df])

My issue is that with this code the execution takes me something like 5 minutes per iteration to execute. I am now trying to make the program more performant. I guess there might be a more elegant way to achieve this than having all those for loops? I need them in order to get the correct "div" for each xpath. I would be glad for some recommendations!

  • Hi. Could you share data so we can work on the same base and see how we could improve it? – Lumber Jack Jan 31 '21 at 16:51
  • Fair enough, i have added the HTML code that I want to scrape and format – BlackElefant Jan 31 '21 at 16:59
  • Just to understand. I see the html you want to scrap, I understand you want perform your code, but you haven't put all your code there so we could reproduce it, isn't it? – Lumber Jack Jan 31 '21 at 17:22
  • correct, the functions where the driver is setup, the website is opened,etc I did not include as I thought it would not be relevant – BlackElefant Jan 31 '21 at 17:36
  • Culprit could be `master_df=pd.concat([master_df,data_df])` as discused and solved by post [Why does concatenation of DataFrames get exponentially slower?](https://stackoverflow.com/questions/36489576/why-does-concatenation-of-dataframes-get-exponentially-slower). Advice is: **Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.** – DarrylG Jan 31 '21 at 18:03

1 Answers1

0

I would recommend profiling your code to see where the bottlenecks are. cProfile is one I typically use.