How to fetch a GitHub repository for a specific commit efficiently?

Question

I'm trying to build a continuous integration system. Each push to GitHub will trigger a build.

Each build will need to checkout/download the repository for the commit it's processing. I'm trying to find a way to do that that would not take minutes on large repositories (because the build takes a few seconds only…).

Please note that I do not want to store data between builds (that removes the possibility of caching).

The solutions I've explored:

git clone followed by a checkout of the commit: works but takes minutes for large repositories
git 2.5 supposedly introduced a way to fetch a single commit but I cannot get it to work with GitHub, my guess is that they are not using git 2.5 (Edit: doesn't work with GitHub indeed)
use the GitHub API for git data but I cannot figure out if I can somehow download all files at a revision, and do that efficiently (i.e. avoid a single HTTP request per file) (Edit: it seems GitHub allows to download files as "tree" - not sure what it means - but for large repositories HTTP responses are truncated and they encourage to simply use git… back to square one)

Every other solution I see on GitHub assumes either a recent git version on the server, or that it's fine to clone the repository once but in my case it's not. I'm starting from scratch on every build (because that's a constraint).

So I'm asking in the specific case of GitHub: how can I download (in any way) the code at a specific commit to be able to run continuous integration tools on that commit?

@JoshLee Thank you! At least I know why it doesn't work on GitHub, I wasn't sure if I was doing something wrong :) — Matthieu Napoli, Aug 12 '17 at 20:12

score 7 · Accepted Answer · answered Aug 12 '17 at 20:26

You can download an archive of a particular commit from GitHub using a URL of the form:

https://github.com/PROJECT/REPO/archive/COMMITID.zip

For example, if I have a project named "dockerize" and I want to download commit id 169532e I can run:

curl -OL https://github.com/larsks/dockerize/archive/169532e.zip

I've used a short commit ID here, but you can use a long commit ID, or a branch, or a tag, etc.

This will give me a .zip archive with the files from that particular commit. The top-level directory wil be named PROJECT-LONGCOMMITID. For example, the above command would result in an archive in which the top-level directory is dockerize-169532eba46757aca8002e1c9bb257079a739f75/README.md.

This gets you only the files in that particular commit; it does not fetch the .git directory or any repository history.

Thanks that's perfect! For reference [here is the API documentation for that](https://developer.github.com/v3/repos/contents/#get-archive-link), and here is the full command I'm using: `curl -sS -L -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/repos/$REPOSITORY_NAME/tarball/$COMMIT_ID | tar --strip-components=1 -C /tmp/code -xz` (it works with private repositories). — Matthieu Napoli, Aug 12 '17 at 22:17
For public repositories it could be: `curl -sS -L https://api.github.com/repos/$REPOSITORY_NAME/tarball/$COMMI‌T_ID | tar --strip-components=1 -C /tmp/code -xz` — Matthieu Napoli, Aug 12 '17 at 22:17

How to fetch a GitHub repository for a specific commit efficiently?

1 Answers1