0

So this is what I currently have. This code makes about 5,000 calls to the NBA API and returns the total Games Played and Points Scored of every NBA player who has ever played in the playoffs. The players (names as keys, stats as values) are all added to the 'stats_dict' dictionary.

MY QUESTION IS THIS: does anybody know how I could significantly increase the speed of this process by using threading? Right now, it takes about 30 minutes to make all these API calls, which of course I would love to significantly improve upon. I've never used threads before and would appreciate any guidance.

Thanks

import pandas as pd
from nba_api.stats.endpoints import commonallplayers
from nba_api.stats.endpoints import playercareerstats
import numpy as np

player_data = commonallplayers.CommonAllPlayers(timeout = 30)
player_df = player_data.common_all_players.get_data_frame().set_index('PERSON_ID') 

id_list = player_df.index.tolist()


def playoff_stats(person_id):

    player_stats = playercareerstats.PlayerCareerStats(person_id, timeout = 30)
    
    yield player_stats.career_totals_post_season.get_data_frame()[['GP', 'PTS']].values.tolist()


stats_dict = {} 


def run_it():

    for i in id_list:

        try:
            stats_call = next(playoff_stats(i)) 

            if len(stats_call) > 0:
                stats_dict[player_df.loc[i]['DISPLAY_FIRST_LAST']] = [stats_call[0][0], stats_call[0][1]]

        except KeyError: 
            continue

2 Answers2

0

You're asking the wrong question. The real question is: why is my program taking 30 minutes?

In other words, where is my program spending time? What is it doing that's taking so long?

You can speed up a program by using threads ONLY if these two things are true:

  • The program is spending a significant fraction of its time waiting on some external resource (the internet or a printer, for example)
  • There is something useful that it could do in another thread while it's waiting

It is far from clear whether both of those things are true in your case.

Check out the time module in the standard Python library. If you go through your code and insert print(time.time()) statements at critical points, you will quickly see where the program is spending its time. Until you figure that out, you might be totally wasting your effort by writing a threaded version.

By the way, there are more sophisticated ways to get a handle on a program's performance, but your program is so incredibly slow that a few simple print statements should point you toward a better understanding.

Paul Cornelius
  • 9,245
  • 1
  • 15
  • 24
  • Okay -- thanks. I do profess my ignorance (clearly) to what's going on "under the hood" most of the time. My question to you would be this: to simply make the API call ONCE (for one specific player) takes about 0.25 seconds. Doing this 4602 times, then, would take about 19 minutes. So maybe my code isn't 'incredibly slow' (or maybe it is, just explaining the situation as I see it) and the amount of time my program takes actually makes sense for the number of API calls I'm making?? – Michael Black Jun 29 '21 at 00:22
  • Nonetheless I appreciate the feedback and will insert some time statements throughout to see what I can improve on – Michael Black Jun 29 '21 at 00:27
  • Sorry I didn't mean to imply that there was anything wrong with your code. Perhaps I should have said that your "program" was slow. Where is the data coming from? Even if it's being downloaded from the internet, 0.25s still seems rather a long time for data that is probably just some numbers. Maybe you're grabbing a lot more data than you really need. If the API is well documented you might find some ideas about how to deal with that. (I'm just guessing here...) – Paul Cornelius Jun 29 '21 at 00:37
0

Firstly, as others have mentioned, your program is not particularly optimized, which should be your number one step. I would recommend debugging it using some print statements or measuring run time (How to measure time taken between lines of code in python?).

Another possible solution that is a little more brute force is concurrent.futures. This can help to run a lot of things at once, but once again it won't matter if your code isn't optimized as you'll just be running unoptimized code a lot.

This link is for web scraping, but it might be helpful.