26

I need to do a very large search on Github for a statistic in my thesis.

For example, I need to explore a large number of Android projects on GitHub, but the site limits the search result to 1000 (ex. https://github.com/search?l=java&q=onCreate&ref=searchresults&type=Code&utf8=%E2%9C%93). Also using the Java GitHub API I tried the library org.eclipse.egit.github.core.client.GitHubClient using the method GitHubClient.searchRepositories() but even there the number of results is limited.

Does anyone know how to get all results?

bitoiu
  • 6,893
  • 5
  • 38
  • 60
scott
  • 261
  • 1
  • 3
  • 3
  • 2
    Have you looked at the [GitHub Archive](https://www.githubarchive.org/)? It could be a way to get your data without having to bother the live GitHub search API, which as you found out gives a limited number of results, and is also rate-limited. – Wander Nauta Jun 02 '16 at 22:12
  • Are you able to page through the results? You could get the first chunk of 1000, get the next chunk, and repeat until you have it all. – Kyle Falconer Jun 02 '16 at 22:19
  • This is not a Java question, or even a programming question. – shmosel Jun 02 '16 at 22:23
  • Correct, you're limited to 1000 results per search & 30 requests per minute: https://developer.github.com/v3/search/#about-the-search-api – zapl Jun 03 '16 at 00:33
  • Is your code publicly available? – Soubriquet Jan 13 '17 at 18:48
  • For latecomers' information: the limitation of 1000 results is lifted since I could not find "1000" in the link provided by "zapi", and the github query can easily go to Page 11th. – Peipei Dec 01 '20 at 04:27
  • @Peipei The limit of 1000 still holds, unfortunately. Here is the link saying that - https://docs.github.com/en/rest/search?apiVersion=2022-11-28 – desert_ranger Dec 22 '22 at 17:27

2 Answers2

39

The Search API will return up to 1000 results per query (including pagination), as documented here:

https://developer.github.com/v3/search/#about-the-search-api

However, there's a neat trick you could use to fetch more than 1000 results when executing a repository search. You could split up your search into segments, by the date when the repositories were created. For example, you could first search for repositories that were created in the first week of October 2013, then second week, then September, and so on.

Because you would be restricting search to a narrow period, you will probably get less than 1000 results, and would therefore be able to get all of them. In case you notice that more than 1000 results are returned for a period, you would have to narrow the period even more, so that you can collect all results.

https://help.github.com/articles/searching-repositories/#search-based-on-when-a-repository-was-created-or-last-updated

You should be able to automate this via the API.

Ivan Zuzak
  • 18,068
  • 3
  • 69
  • 61
  • Seems like you can't query the repository search api by date created. The following will search, but sort, order, and created are ignored: `curl -H 'Accept: application/vnd.github.v3.text-match+json' 'https://api.github.com/search/repositories?q=language:Java&created>=2013-04-11T00:00:00Z&sort=created&order=asc' | grep created_at` – Soubriquet Jan 15 '17 at 23:09
  • 2
    @Soubriquet You're not constructing that URL correctly. The "created" parameter should be a part of the query, not a parameter on its own. – Ivan Zuzak Jan 16 '17 at 10:06
  • 1
    Also, you can't sort by created -- the fields you can sort by are listed here: https://developer.github.com/v3/search/#parameters – Ivan Zuzak Jan 16 '17 at 10:07
  • 8
    Awesome! Thanks! In case anyone else needs this: `https://api.github.com/search/repositories?q=language:Java+created:>=2013-04-11T00:00:00Z&order=asc` – Soubriquet Jan 16 '17 at 13:52
  • The original query was for code search though, i.e., contents of files rather than repositories. Unfortunately it seems that creation dates aren't available when searching for files... – s.d Jun 15 '17 at 22:22
  • 2
    The `order=asc` applies on the `sort` field which can be stars, forks, updated or best_match(default). So `curl -G https://api.github.com/search/repositories --data-urlencode "q=created:>2013-04-11" --data-urlencode "order=asc"` gets all repositories created after 2013-04-11 but not in the created order. We can fetch repositories within a range using `q=created:time1..time2`, but the results are not sorted by created time. – Alex Mar 20 '18 at 11:49
  • OP is trying to search code, not repositories. You can't sort by any kind of date when searching code. – Bernard Jan 13 '23 at 15:51
8

If you are searching for all files in Github with filename:your-file-name, you could also slice it with a query attribute : size.

For example, you are looking for all files named test.rb in Github, Github API may return more than 11M results, but you could only get 1000 of them because the GitHub Search API provides up to 1,000 results for each search. An url like : https://api.github.com/search/code?q=filename:test.rb+size:1000..1500 would be able to slice your search by changing size range.

Frank Yang
  • 123
  • 1
  • 5