Why does GitHub pagination give varying results?

Question

If I perform a code search using the GitHub Search API and request 100 results per page, I get a varying number of results -

import requests

# url = "https://api.github.com/search/code?q=torch +in:file + language:python+size:0..250&page=1&per_page=100"
url = "https://api.github.com/search/code?q=torch +in:file + language:python&page=1&per_page=100"

headers = {
'Authorization': 'Token xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
}

response = requests.request("GET", url, headers=headers).json()

print(len(response['items']))

Thanks to this answer, I have the following workaround: I run the query multiple times until I get the required results on a page.

My current project requires me to iterate through the search API looking for files of varying sizes. I am basically repeating the procedure described here. Therefore my code looks something like this -

url = "https://api.github.com/search/code?q=torch +in:file + language:python+size:0..250&page=1&per_page=100"

In this case, I don't know in advance the number of results a page should actually have. Could someone tell me a workaround for this? Maybe I am using the Search API incorrectly?

I think the appropriate behavior is to use the pagination api and request subsequent pages until you have the number of results you want (or until there are no more results). — larsks, Jan 03 '23 at 02:51
Thank you! I am thoroughly confused with GitHub's API documentation. Is there something I should learn to become more acquainted with this system? Perhaps read more about GraphQL and the REST API? — desert_ranger, Jan 04 '23 at 00:51

score 1 · Accepted Answer · answered Jan 04 '23 at 02:53

GitHub provides documentation about Using pagination in the REST API. Each response includes a Link header that includes a link for the next set of results (along with other links); you can use this to iterate over the complete result set.

For the particular search you're doing ("every python file that contains the word 'torch'"), you're going to run into rate limits fairly quickly, but for example the following code would iterate over results, 10 at a time (or so), until 50 or more results have been read:

import os
import requests
import httplink

url = "https://api.github.com/search/code?q=torch +in:file + language:python&page=1&per_page=10"

headers = {"Authorization": f'Token {os.environ["GITHUB_TOKEN"]}'}

# This is the total number of items we want to fetch
max_items = 50

# This is how many items we've retrieved so far
total = 0

try:
    while True:
        res = requests.request("GET", url, headers=headers)
        res.raise_for_status()
        link = httplink.parse_link_header(res.headers["link"])

        data = res.json()
        for i, item in enumerate(data["items"], start=total):
            print(f'[{i}] {item["html_url"]}')

        if "next" not in link:
            break

        total += len(data["items"])
        if total >= max_items:
            break

        url = link["next"].target
except requests.exceptions.HTTPError as err:
    print(err)
    print(err.response.text)

Here I'm using the httplink module to parse the Link header, but you could accomplish the same thing with an appropriate regular expression and the re module.

Why does GitHub pagination give varying results?

1 Answers1