Efficient GraphQL query to retrieve ALL commits in a repository via GitHub's v4 API?

Question

I am trying to construct a GraphQL query for GitHub's v4 GraphQL API to retrieve all commits from a given repository (regardless of branch).

With the github/training-kit repository as an example, I currently have to do this in several steps, namely:

Retrieve a list of all branches of the repository with this query (using pageInfo as needed to repeat the query to get all branches):

{
  repository(owner: "github", name: "training-kit") {
    refs(first: 10, refPrefix: "refs/heads/", after: "") {
      totalCount
      edges {
        node {
          name
        }
      }
      pageInfo {
        hasNextPage
        endCursor
      }
    }
  }
}

Loop through the list of branches, and for each branch get its commit history. Within each branch, I usually need to run the query many times because of pagination limits. For example, this would be the query for the master branch to get that branch's first 100 commits:

{
  repository(owner: "github", name: "training-kit") {
    refs(query: "master", refPrefix: "refs/heads/", first: 1) {
      nodes {
        target {
          ... on Commit {
            history(first: 100) {
              nodes {
                oid
              }
              pageInfo {
                hasNextPage
                endCursor
              }
            }
          }
        }
      }
    }
  }
}

To me this solution is inefficient especially because of step 2. where most commits would be duplicated across many branches (not to mention I have to make many queries just to get all the commits from just one branch). Once I get the list of commits from each branch I have to de-duplicate them. The whole process requires many, many queries and lots of duplicated effort. However, since there are commits that can only be reached by certain branches, I don't see how else to do it besides exhaustively querying each branch.

Can anyone suggest a more efficient strategy that better utilizes the GitHub GraphQL API to retrieve all commits from a repository?

Thank you!

P.S. For reference, I've looked at the following questions but none of them seem to answer my question:

a. Github GraphQL - Getting a repository's list of commits - They were only aiming to get the latest n number of commits from a repository's default branch, not all commits regardless of branch.

b. Commits stats from github using graphql - This question was only interested in the default branch, which might not include all commits.

c. Querying all commits in a single repository with the GitHub GraphQL API v4 - Only interested in the master branch and how to do pagination, not all commits of the repository.

I believe this is a very good question, and as I see it hasn't been answered yet. Even if you simplified it, and say you "only want to retrieve all commits within a branch", you would have to iterate over these 100 page commit results. I am currently trying to retrieve all commits of a 670 commits repository, and it takes me around 1200 ms per page only to get the node id within the commit object. All of this adds up to 8 seconds for the whole repository. If I try to get some other properties besides the id, it takes me up to 4000 ms per page. — Armino, Jan 11 '22 at 20:15
Thanks @Armino I agree with your comments, and the wait times I experienced are about the same as yours. Really hope someone can come up with a more efficient solution! — hpy, Jan 18 '22 at 00:29
Thanks. I was experimenting with the following idea: if you fetch all the page cursors (only page cursors in results), and after that fire off "real" page requests to fetch all commit data IN PARALLEL, you should have a significant duration reduction. I was able to fetch all 7 pages with full commit data in 7 seconds (compared to the previous 7 x 4 seconds). This was done during a quick test and these results are not conclusive, I am not sure what the behavior would be on a large number of requests etc., but it is worth trying. — Armino, Jan 19 '22 at 09:03
That sounds like a considerable improvement, @Armino could you post your solution as an answer including the GraphQL calls? — hpy, Jan 19 '22 at 13:39
@hpy thank you. I will try to post the answer as soon as I've tested it a bit more thoroughly. — Armino, Jan 19 '22 at 14:21

Armino · Answer 1 · 2022-01-19T21:08:36.880

This is my C# based idea on how this issue can be approached, perhaps by not completely solving it, but rather improving the performance. The below shown code solves the problem of "retrieving all commits within the default branch of a repository", however, it can be applied to almost any cursor-based pagination scenario on GitHub GraphQL. I am aware that your question was concerning "all commits of all branches, deduplicated", however, I believe this approach may be useful to you as well.

The inherent problem of querying a large repository is the 100 results per page limit, and the fact that you have to iterate over the pages one by one, as each page contains the cursor to the next page. I've solved the cursor-identification issue in my solution, and by concurrently sending all page requests at once reduced the overall execution time.

The idea is to create an initial request to GitHub GraphQL API fetching only the total count for the given filters. I assumed that we would fetch 100 results per page. As the GitHub commit page cursors are always in the format "xX9XXXXXXX3961722145Xf39cc9617XXXXxxx 99", where the first part is the first commit oid (the first commit of the first page - all cursors on all pages use this oid - it doesn't change when iterating), and 99 is the order number of the last commit of the previous page (0 based index), it is quite easy to calculate what the cursors for each page of a 670 commits repository will be by only making the "totalCount" request:

null
"xX9XXXXXXX3961722145Xf39cc9617XXXXxxx 99"
"xX9XXXXXXX3961722145Xf39cc9617XXXXxxx 199"
"xX9XXXXXXX3961722145Xf39cc9617XXXXxxx 299"
"xX9XXXXXXX3961722145Xf39cc9617XXXXxxx 399"
"xX9XXXXXXX3961722145Xf39cc9617XXXXxxx 499"
"xX9XXXXXXX3961722145Xf39cc9617XXXXxxx 599"

After the cursors that identify the beginning of each page are generated, we can prepare a separate Task for each page, where the Task will contain a request to GitHub GraphQL to fetch one page, and use Task.WhenAll to execute them all.

I've tested this on a repository of 670 commits, and all 7 pages are fetched within around 7 seconds in total. If I iterate through each page, it takes around 4 seconds per page, which totals to 25 - 30 seconds.

It should be noted that this was not tested in a production environment, it doesn't concern error handling, and the parallelism/concurrency implementation can most probably be improved, so it should only be viewed as a proof of concept. Additionally, I am not sure how GitHub API will handle when you send requests for repositories that have 100 or 1000 pages of commits.

public async Task<List<Commit>> GetCommitsByPeriodAsync(Guid integrationId, DateTime since, string repositoryName, string repositoryOwner)
{
    string initialCursor = null;

    var firstPageInfo = await GetDefaultBranchCommitsFirstPageInfoAsync(since, initialCursor, repositoryOwner, repositoryName);
    var commitPagesCursors = GetCommitPagesCursors(firstPageInfo, initialCursor );

    var tasks = commitPagesCursors.Select(x => GetDefaultBranchCommitsPageByPeriodAsync(since, x, repositoryOwner, repositoryName));

    var results = await Task.WhenAll(tasks);
    var branchCommitsByPeriod = results.SelectMany(x => x.Commits)
                                       .ToList();

    return branchCommitsByPeriod;
}

private List<string> GetCommitPagesCursors(GetCommitsPageInfoResponse firstPageInfo, string initialCursor)
{
    // Two initial cursors will always be "null", and "oid 99" for 100 items pages
    var cursors = new List<string> { initialCursor, firstPageInfo.PageInfo.EndCursor };
    int totalCount = firstPageInfo.TotalCount;

    var firstCommitCursorSplit = firstPageInfo.PageInfo.EndCursor.Split(" ");
    var firstCommitId = firstCommitCursorSplit[0];

    var lastPageCommitNumberString = firstCommitCursorSplit[1];

    // TO DO: handling TryParse failure scenario
    int.TryParse(lastPageCommitNumberString, out int lastPageCommitNumber);

    // 100 is the max number of objects in a page
    lastPageCommitNumber += 100;

    while (lastPageCommitNumber < totalCount)
    {
        string nextPageCursor = $"{firstCommitId} {lastPageCommitNumber}";
        cursors.Add(nextPageCursor);

        lastPageCommitNumber += 100;
    }

    return cursors;
}

public async Task<GetCommitsPageInfoResponse> GetDefaultBranchCommitsFirstPageInfoAsync(DateTime since, string cursor, string repositoryOwner, string repositoryName)
{
    // Code omitted for brevity
    var commitsRequest = new GraphQLRequest
    {
        Query = @"
            query GetCommitsFirstPage($cursor: String, $commitsSince: GitTimestamp!, $repositoryName: String!, $repositoryOwner: String!) {
              repository(name: $repositoryName, owner: $repositoryOwner) {
                defaultBranchRef{
                  target {
                    ... on Commit {
                      history(after: $cursor, since: $commitsSince) {
                        totalCount
                        pageInfo {
                          endCursor
                          hasNextPage
                        }                      
                      }
                    }
                  }
                }
              }
            }",
        OperationName = "GetCommitsFirstPage",
        Variables = new
        {
            commitsSince = since.ToString("o"),
            cursor = cursor,
            repositoryOwner = repositoryOwner,
            repositoryName = repositoryName
        }
    };
    // Code omitted for brevity
}

public async Task<GetCommitsPageResponse> GetDefaultBranchCommitsPageByPeriodAsync(DateTime since, string cursor, string repositoryOwner, string repositoryName)
{
    
    // Code omitted for brevity

    var commitsRequest = new GraphQLRequest
    {
        Query = @"
            query GetCommitsSinceTimestamp($cursor: String, $commitsSince: GitTimestamp!, $repositoryName: String!, $repositoryOwner: String!) {
              repository(name: $repositoryName, owner: $repositoryOwner) {
                defaultBranchRef{
                  target {
                    ... on Commit {
                      history(after: $cursor, since: $commitsSince) {
                        pageInfo {
                          endCursor
                          hasNextPage
                        }
                        edges {
                          node {
                            oid
                            additions
                            deletions
                            commitUrl
                            url
                            committedDate
                            associatedPullRequests (first: 10) {
                                              nodes {
                                                id
                                                mergedAt
                                              }
                                            }
                            repository {
                              databaseId
                              nameWithOwner
                            }
                            author {
                              name
                              email
                              user {
                                login
                              }
                            }
                            message
                          }
                        }
                      }
                    }
                  }
                }
              }
            }",
        OperationName = "GetCommitsSinceTimestamp",
        Variables = new
        {
            commitsSince = since.ToString("o"),
            cursor = cursor,
            repositoryOwner = repositoryOwner,
            repositoryName = repositoryName
        }
    };
    // Code omitted for brevity
}

Like you said not a complete solution, but still very promising. I don't know C# but will give this algorithm a try soon and report back. Thank you @Armino! — hpy, Jan 21 '22 at 21:18
Also expanding on this. "where the first part is the first commit oid (the first commit of the first page - all cursors on all pages use this oid - it doesn't change when iterating)," The first commit on the first page is actually the very last (most recent) commit. This follows the order of `git log` — shellscape, Mar 06 '22 at 15:37

Efficient GraphQL query to retrieve ALL commits in a repository via GitHub's v4 API?

1 Answers1