-1

I have this code and i want to extract github user info with github api and insert all the info into a mysql table but this inserting into mysql is taking so long nearly 30-40 minutes for a single user for data and that also only for info regarding repo i want so want to add info about commit that may take a day please suggest me what should i do to increase the speed of the code and insertion in table the code is big so sorry for that but i want to show the whole code so i could get suggestions regarding overall code.

connection = mysql.connector.connect(host='host',
                             user='root',
                             password='pas',
                             database='db',)

cursor = connection.cursor()

#user name--------------------------------------------------------------------------
username = 'waldekmastykarz'
#-----------------------------------------------------------------------------------
token = os.getenv('GITHUB_TOKEN', '-----------githubtoken---------')
params = {}
headers = {'Authorization': f'token {token}'}
data = requests.get('https://api.github.com/users/' +
                    username, headers=headers, params=params)
data = data.json()


#initial details ----------------------------------------------------------------
initial_details = {}
initial_details['username'] = username
initial_details['name'] = data['name']
initial_details['email'] = data['email']
initial_details['id'] = data['id']
initial_details['location'] = data['location']
initial_details['followers'] = data['followers']
initial_details['following'] = data['following']


#total details------------------------------------------------------------------
Total_commits_of_user = 0
total_repos = 0
total_stars = 0


#total repos -------------------------------------------------------------------
url = data['repos_url']
page_no = 1
repos_data = []
while (True):
    response = requests.get(url, headers=headers, params={"per_page": 100})
    response = response.json()
    repos_data = repos_data + response
    total_repos = len(response)+total_repos
    repos_fetched = len(response)
    if (repos_fetched == 100):
        page_no = page_no + 1
        url = data['repos_url'] + '?page=' + str(page_no)
    else:
        break

print(total_repos)

#get all repos info ---------------------------------------------------------------
all_repo_data = {}
for i, repo in enumerate(repos_data):
    all_repo_data[(repo['name'])] = {}
    all_repo_data[(repo['name'])]['name'] = (repo['name'])
    all_repo_data[(repo['name'])]['id'] = (repo['id'])
    all_repo_data[(repo['name'])]['description'] = repo['description']
    all_repo_data[(repo['name'])]['created_at'] = repo['created_at']
    all_repo_data[(repo['name'])]['updated_at'] = repo['updated_at']
    all_repo_data[(repo['name'])]['owner'] = repo['owner']['login']
    all_repo_data[(repo['name'])]['stars'] = repo['stargazers_count']
    all_repo_data[(repo['name'])]['url'] = repo['url']
    all_repo_data[(repo['name'])
                  ]['commits_url'] = repo['commits_url'].split("{")[0]
    all_repo_data[(repo['name'])]['languages'] = (repo['url'] + '/languages')
    all_repo_data[(repo['name'])]['contributors'] = repo['contributors_url']

    #total stars---------------------------------------------------------------------------------------
    total_stars = all_repo_data[(repo['name'])]['stars']+total_stars

    #corect format for date time ----------------------------------------------------------------------
    d1 = datetime.datetime.strptime(
        all_repo_data[(repo['name'])]['updated_at'], "%Y-%m-%dT%H:%M:%SZ")
    d2 = datetime.datetime.strptime(
        all_repo_data[(repo['name'])]['created_at'], "%Y-%m-%dT%H:%M:%SZ")
    new_format = "%Y-%m-%d"
    d1.strftime(new_format)
    d2.strftime(new_format)
    d1.date()
    d2.date()
    all_repo_data[(repo['name'])]['created_at'] = d2
    all_repo_data[(repo['name'])]['updated_at'] = d1


#total commits----------------------------------------------------------------------------------------------
    total_commit = 0
    url = all_repo_data[(repo['name'])]['commits_url']
    page_no = 1
    while (True):
        response = requests.get(url, headers=headers, params={"per_page": 100})
        response = response.json()
        total_commit = len(response)+total_commit
        commit_fetched = len(response)
        if (commit_fetched == 100):
            page_no = page_no + 1
            url = all_repo_data[(repo['name'])]['commits_url'] + \
                '?page=' + str(page_no)
        else:
            break
    #total commits -------------------------------------------------------------------------------------
    all_repo_data[(repo['name'])]['total_commits'] = total_commit


# commit_by_user ----------------------------------------------------------------------------------------
    commit_by_user = 0
    url = all_repo_data[(repo['name'])]['commits_url']
    page_no = 1
    while (True):
        response = requests.get(url, headers=headers, params={
                                "author": username, "per_page": 100})
        response = response.json()
        commit_by_user = len(response)+commit_by_user
        commit_by_user_fetched = len(response)
        if (commit_by_user_fetched == 100):
            page_no = page_no + 1
            url = all_repo_data[(repo['name'])]['commits_url'] + \
                '?page=' + str(page_no)
        else:
            break
    #total commits by user---------------------------------------------------------------------------------
    all_repo_data[(repo['name'])]['commit_by_user'] = commit_by_user
    Total_commits_of_user = commit_by_user+Total_commits_of_user


#contributors----------------------------------------------------------------------------------------------
    total_contributors = 0
    url = all_repo_data[(repo['name'])]['contributors']
    page_no = 1
    while (True):
        response = requests.get(url, headers=headers, params={"per_page": 100})
        response = response.json()
        total_contributors = len(response)+total_contributors
        contributors_fetched = len(response)
        if (contributors_fetched == 100):
            page_no = page_no + 1
            url = all_repo_data[(repo['name'])]['contributors'] + \
                '?page=' + str(page_no)
        else:
            break
    #total contributors---------------------------------------------------------------------------------------
    all_repo_data[(repo['name'])]['contributors'] = total_contributors


##commit dates-------------------------------------------------------------------------------------------------
    total_commit = 0
    commit_data = {}
    url = all_repo_data[(repo['name'])]['commits_url']
    page_no = 1
    while (True):
        response = requests.get(url, headers=headers, params={"per_page": 100})
        response = response.json()
        for commits in response:
            d1 = datetime.datetime.strptime(
                commits['commit']['author']['date'], "%Y-%m-%dT%H:%M:%SZ")
            new_format = "%Y-%m-%d"
            d1.strftime(new_format)
            d1.date()
            # print(all_repo_data[(repo['name'])]['name'],d1 )
            commit_fetched = len(response)
        if (commit_fetched == 100):
            page_no = page_no + 1
            url = all_repo_data[(repo['name'])]['commits_url'] + \
                '?page=' + str(page_no)
        else:
            break

    sql = "INSERT INTO user_repo (user_id, github_id, repo_id, user_connection, repo_name, contributors, total_commits, commit_by_user, created_date, last_commit) VALUES (  %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    cursor.execute(sql, (
                initial_details['id'],
                username,
                all_repo_data[(repo['name'])]['id'],
                all_repo_data[(repo['name'])]['owner'],
                all_repo_data[(repo['name'])]['name'],
                all_repo_data[(repo['name'])]['contributors'],
                all_repo_data[(repo['name'])]['total_commits'],
                all_repo_data[(repo['name'])]['commit_by_user'],
                all_repo_data[(repo['name'])]['created_at'],
                all_repo_data[(repo['name'])]['updated_at'],
            ))
        
connection.commit()


torek
  • 448,244
  • 59
  • 642
  • 775
zeak
  • 13
  • 3
  • I am not sure what is your bottleneck. If you have a huge amount of `insert`s you should consider to use `TRANSACTION`s. See: https://stackoverflow.com/a/7978511/2681662 – MSH Oct 14 '21 at 06:46
  • @MSH it is huge amount of inserts – zeak Oct 14 '21 at 06:48
  • 1
    This doesn't look like a MySQL issue albeit that it could be improved in that respect. Obviously, without the access token it's impossible for anyone other than yourself to run this but my best guess is that the performance is degraded by a very large number of GET requests. Try printing the time (actual time of day) before you enter into the many *while True* blocks. That might give you a clue. Then you could consider shifting those blocks of code to discrete processes that could run concurrently –  Oct 14 '21 at 07:10
  • @BrutusForcus can you tell me how? – zeak Oct 14 '21 at 07:49
  • Without actually rewriting it for you, the best I can do is refer you to: https://docs.python.org/3.9/library/multiprocessing.html –  Oct 14 '21 at 08:04
  • Please trim your code to make it easier to find your problem. Follow these guidelines to create a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). – Community Oct 15 '21 at 10:55

1 Answers1

0

There can be 2 solutions:

transcations

MySQL has a flag which commits your queries automatically. Means when you send an insert/update/delete query it will automatically commit them. It slows down things. What you should do is to unset the autocommit.

SET autocommit = 0;

Now you send all inserts then commit them at the end, when all your queries are done as such:

START TRANSACTION
INSERT INTO `the_table`(field_1, filed_2, ..., field_n) VALUES (value_1, value_2, ..., value_n);
INSERT INTO `the_table`(field_1, filed_2, ..., field_n) VALUES (value_1, value_2, ..., value_n);
...
INSERT INTO `the_table`(field_1, filed_2, ..., field_n) VALUES (value_1, value_2, ..., value_n);
COMMIT

See: https://stackoverflow.com/a/7978511/2681662

Multiple inserts

Inserting multiple lines at once is another option.

INSERT INTO `the_table`
    (field_1, field_2, ..., field_n)
VALUES
    (value_1, value_2, ..., value_n),
    (value_1, value_2, ..., value_n),
    ...
    (value_1, value_2, ..., value_n);

See: https://stackoverflow.com/a/6889087/2681662

However this has other limitations such as the maximum allowed data size the server accepts.

MSH
  • 1,743
  • 2
  • 14
  • 22