48

Sometimes, the original GitHub repository of a piece of software I'm using, such as linkchecker, is seeing little or no development, while a lot of forks have been created (in this case: 142, at the time of writing).

For each fork, I'd like to know:

  • which branches it has with commits ahead of the original master branch

and for each such branch:

  • how many commits it is ahead of the original
  • how many commits it is behind

GitHub has a web interface for comparing forks, but I don't want to do this manually for each fork, I just want a CSV file with the results for all forks. How can this be scripted? The GitHub API can list the forks, but I can't see how to compare forks with it. Cloning every fork in turn and doing the comparison locally seems a bit crude.

reinierpost
  • 8,425
  • 1
  • 38
  • 70
  • 2
    ++, but note that there is at least one issue with this approach... a fork can go very off-tangent from the original repo, in ways that may be good and/or bad, so knowing which fork has more commits isn't necessarily an indication of which is "ahead" of the original or not. – stevieb Feb 25 '19 at 18:01
  • 6
    I'm looking for a quick way to select the forks worth examining more closely. If you have a better idea, I'm all ears! – reinierpost Feb 26 '19 at 14:22
  • 2
    Related, probably a duplicate in fact: [Github, forked repositories ahead of master: active users](https://stackoverflow.com/q/47393854/3258851). – Marc.2377 Dec 19 '19 at 03:58
  • ! I didn't know about that feature. I don't think that question is a duplicate (I still want what I'm asking) but it definitely helps, thanks! – reinierpost Dec 19 '19 at 09:45
  • 1
    @reinierpost you might want to check this out: https://useful-forks.github.io/?repo=wummel/linkchecker – payne Feb 22 '22 at 05:07

8 Answers8

76

After clicking "Insights" on top and then "Forks" on the left, the following bookmarklet prints the info (including links to ZIP files) directly onto the web page like this:

screenshot

Or like this if you click "Switch to tree view":

screenshot

The code to add as a bookmarklet (or to paste into the console):

javascript:(async () => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const aTags = [...document.querySelectorAll('div.repo a:last-of-type')].slice(1).concat([...document.querySelectorAll('div.repository-content ul a:last-of-type:not(.Link--muted)')]);

  for (const aTag of aTags) {
    /* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it directly onto the web page */
    await fetch(aTag.href)
      .then(x => x.text())
      .then(html => aTag.outerHTML += `${html.match(/This branch is.*/).pop().replace('This branch is', '').replace(/([0-9]+ commits? ahead)/, '<font color="#0c0">$1</font>').replace(/([0-9]+ commits? behind)/, '<font color="red">$1</font>')}` + " <a " + `${html.match(/href="[^"]*\.zip">/).pop() + "Download ZIP</a>"}`)
      .catch(console.error);
  }
})();

You can also paste the code into the address bar, but note that some browsers delete the leading javascript: while pasting, so you'll have to type javascript: yourself. Or copy everything except the leading j, type j, and paste the rest.

It has been modified from this answer.

root
  • 1,812
  • 1
  • 12
  • 26
26

useful-forks

useful-forks is an online tool which filters all the forks based on ahead criteria. I think it answers your needs quite well. :)

For the repo in your question, you could do: https://useful-forks.github.io/?repo=wummel/linkchecker

That should provide you with similar results to (ran on 2022-04-02): website

Also available as a Chrome Extension

Download it here: https://chrome.google.com/webstore/detail/useful-forks/aflbdmaojedofngiigjpnlabhginodbf

Useful button

And as a bookmarklet

Add this as the URL of a new bookmark, and click that bookmark when you're on a repo:

javascript:!function(){if(m=window.location.href.match(/github\.com\/([\w.-]+)\/([\w.-]+)/),m){window.open(`https://useful-forks.github.io/?repo=${m[1]}/${m[2]}`)}else window.alert("Not a GitHub repo")}();

Although to be honest, it's a better option to simply get the Chrome Extension, if you can.

Disclaimer

I am the maintainer of this project.

payne
  • 4,691
  • 8
  • 37
  • 85
  • Wonderful: just the output I need, CSV export option, fast, and it even prompts me for a personal access token and explains why, when it notices it's making too many calls. It still only gave me 30 entries though. I'll retry later. – reinierpost Feb 22 '22 at 09:37
  • 2
    @reinierpost The main limitation of the tool is precisely the quite underwhelming amount of calls allowed by GitHub's API for a given Access Token. Unfortunately, this means the tool cannot scan every single branch of each fork. Thus, the tool only compares the `master`/`main` branch. It could be interesting to change the strategy based on the amount of forks that exist for a given repository; I might look into implementing that in the future (or maybe a kind stranger will offer a PR). – payne Feb 23 '22 at 06:24
  • Unfortunately it works really bad. Instead of using simple HTTP calls it tries to use the extremely limited Github API. The much more upvoted Javascript solution from @root does not need any API. –  Jan 21 '23 at 11:57
  • @Ronny I just tried the bookmarklet version of the answer you are talking about, and I must firmly disagree. The other answer: (1) Uses GET requests, which are API calls just as much as what my tool does. (2) Is slower and does not present as much information. (3) Is not exhaustive (the Insight's page cuts short the long lists of forks). (4) Will potentially get your IP banned because it does not respect GitHub guidelines in terms of spamming requests. – payne Jan 22 '23 at 19:02
  • @payne I disagree with most of what you're saying. (1) The source code is available in the answer and everybody can see that it does not use the Github API. Just because it uses GET requests, it's not automatically an API. (2) Indeed, it's slower to not get your IP banned. It does one request after another just as you could do manually in your browser. (3) It's using the forks page which shows all forks in a long list, at least for the repos I've tested it. It has worked well for me with 250+ forks. (4) Typical FUD. If you do one request after another, they don't ban you. –  Jan 23 '23 at 04:18
  • @Ronny This is not the place for such debates, so I'll stop arguing. But nonetheless I would like to point you to an example repository: https://github.com/libgdx/libgdx/network/members (GitHub itself warns you it's not showing the entire list: "Woah, this network is huge! We’re showing only some of this network’s repositories.") – payne Jan 23 '23 at 06:22
10

Had exactly the same itch and wrote a scraper that takes the info printed in the rendered HTML for forks: https://github.com/hbbio/forkizard

Definitely not perfect, but a temporary solution.

Henri
  • 214
  • 2
  • 9
  • 1
    GitHub still doesn't seem to show this info as far as I can tell, am I right? Eg. repo https://github.com/alormil/ipa-rest-api/network/members Or is there some other way currently to do this, otherwise your initiative sounds great, and I will def use it! – riper Nov 08 '20 at 21:44
3

Late to the party - I think this is the second time I've ended up on this SO post so I'll share my js-based solution (I ended up making a bookmarklet by just fetching and searching the html pages). You can either create a bookmarklet from this, or simply paste the whole thing into the console. Works on chromium-based and firefox:

EDIT: if there are more than 10 or so forks on the page, you may get locked out for scraping too fast (429 too many requests in network). Use async / await instead:

javascript:(async () => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);

  for (const fork of forks) {
    /* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it to console */
    await fetch(fork)
      .then(x => x.text())
      .then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
      .catch(console.error);
  }
})();

or you can do batches, but it's pretty easy to get locked out

javascript:(async () => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);

  getfork = (fork) => {
    return fetch(fork)
      .then(x => x.text())
      .then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
      .catch(console.error);
  }

  while (forks.length) {
    await Promise.all(forks.splice(0, 2).map(getfork));
  }
})();

Original (this fires all requests at once and will possibly lock you out if it is more requests/s than github allows)

javascript:(() => {
  /* while on the forks page, collect all the hrefs and pop off the first one (original repo) */
  const forks = [...document.querySelectorAll('div.repo a:last-of-type')].map(x => x.href).slice(1);

  for (const fork of forks) {
    /* fetch the forked repo as html, search for the "This branch is [n commits ahead,] [m commits behind]", print it to console */
    fetch(fork)
      .then(x => x.text())
      .then(html => console.log(`${fork}: ${html.match(/This branch is.*/).pop().replace('This branch is ', '')}`))
      .catch(console.error);
  }
})();

Will print something like:

https://github.com/user1/repo: 289 commits behind original:master.
https://github.com/user2/repo: 489 commits behind original:master.
https://github.com/user2/repo: 1 commit ahead, 501 commits behind original:master.
...

to console.

EDIT: replaced comments with block comments for paste-ability

p-mcgowan
  • 186
  • 2
  • 7
  • How is this supposed to work? In Firefox and Chrome, I can only paste it in as a single line, and it doesn't show any results when I click the resulting bookmark on [the forks page for rclone](https://github.com/rclone/rclone/network/members). However, when I then reload that page in Chrome, the page shows "Access has been restricted - You have triggered an abuse detection mechanism." so *something* must have happened. – reinierpost Apr 10 '21 at 14:38
  • Ah yeah, there are many forks on that one. I assume you just got slapped with a rate limit. I'll update the description with a throttled version - the one I tested with only had about 20 forks – p-mcgowan Apr 14 '21 at 17:18
  • @reinierpost also note - as a bookmarklet, it just makes it easy to click - you still need the console open to see the results of the fetch. what happened was you got rate limited, and since a reload just makes another request, that too was blocked. I've updated the description to scrape them 1 by 1 instead – p-mcgowan Apr 14 '21 at 17:31
  • Mmmm, `console.log`, I should have checked that. It's working now! My Firefox console now has a long list or lines such as `https://github.com/adragomir/rclone: 1 commit ahead, 4251 commits behind rclone:master. ).pop().replace('This branch is ', '')}`)) .catch(console.error); } })();:1:442` – reinierpost Apr 14 '21 at 17:47
  • i think I did a ninja edit some time in between - I wrapped the regex in a block comment and it did not work, but I've changed it and it should work now – p-mcgowan Apr 14 '21 at 20:26
  • It doesn't help, even when I remove all block comments and replace the `.*/` before the ).pop with `.+/`. I'm using current Firefox (87.0) on Ubuntu 18.04. – reinierpost Apr 14 '21 at 20:56
  • 1
    ah, I see the issue. On firefox, when using a marklet, the console prints the line number out with the script. with dark theme, on the left it's grey text (the conosle output) and on the right it's blue text (the line number where the console log is). if you paste it into the console rather than as a marklet it will show "debugger eval code" instead - it's working, it just looks funny – p-mcgowan Apr 16 '21 at 06:40
2

active-forks doesn't quite do what I want, but it comes close and is very easy to use.

reinierpost
  • 8,425
  • 1
  • 38
  • 70
1

Here's a Python script using the Github API. I wanted to include the date and last commit message. You'll need to include a Personal Access Token (PAT) if you need a bump to 5k requests/hr.

USAGE: python3 list-forks.py https://github.com/itinance/react-native-fs

Example Output:

https://github.com/itinance/react-native-fs root 2021-11-04 "Merge pull request #1016 from mjgallag/make-react-native-windows-peer-dependency-optional  make react-native-windows peer dependency optional"
https://github.com/AnimoApps/react-native-fs diverged +2 -160 [+1m 10d] "Improved comments to align with new PNG support in copyAssetsFileIOS"
https://github.com/twinedo/react-native-fs ahead +1 [+26d] "clear warn yellow new NativeEventEmitter()"
https://github.com/synonymdev/react-native-fs ahead +2 [+23d] "Merge pull request #1 from synonymdev/event-emitter-fix  Event Emitter Fix"
https://github.com/kongyes/react-native-fs ahead +2 [+10d] "aa"
https://github.com/kamiky/react-native-fs diverged +1 -2 [-6d] "add copyCurrentAssetsVideoIOS function to retrieve current modified videos"
https://github.com/nikola166/react-native-fs diverged +1 -2 [-7d] "version"
https://github.com/morph3ux/react-native-fs diverged +1 -4 [-30d] "Update package.json"
https://github.com/broganm/react-native-fs diverged +2 -4 [-1m 7d] "Update RNFSManager.m"
https://github.com/k1mmm/react-native-fs diverged +1 -4 [-1m 14d] "Invalidate upload session  Prevent memory leaks"
https://github.com/TickKleiner/react-native-fs diverged +1 -4 [-1m 24d] "addListener and removeListeners methods wass added to pass warning"
https://github.com/nerdyfactory/react-native-fs diverged +1 -8 [-2m 14d] "fix: applying change from https://github.com/itinance/react-native-fs/pull/944"
import requests, re, os, sys, time, json, datetime
from dateutil.relativedelta import relativedelta
from urllib.parse import urlparse

GITHUB_PAT = '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'

def json_from_url(url):
    response = requests.get(url, headers={ 'Authorization': 'token {}'.format(GITHUB_PAT) })
    return response.json()

def date_delta_to_text(date1, date2) -> str:
    ret = []
    date_delta = relativedelta(date2, date1)
    sign = '+' if date1 < date2 else '-'

    if date_delta.years != 0:
        ret.append('{}y'.format(abs(date_delta.years)))

    if date_delta.months != 0:
        ret.append('{}m'.format(abs(date_delta.months)))

    if date_delta.days != 0:
        ret.append('{}d'.format(abs(date_delta.days)))
    else:
        sign = ''
        ret.append('0d')

    return '{}{}'.format(sign, ' '.join(ret))

def iso8601_date_to_date(date):
    return datetime.datetime.strptime(date, '%Y-%m-%dT%H:%M:%SZ')

def date_to_text(date):
    return date.strftime('%Y-%m-%d')

def process_repo(repo_author, repo_name, branch_name, fork_of_fork):
    page = 1

    while 1:
        forks_url = 'https://api.github.com/repos/{}/{}/forks?per_page=100&page={}'.format(repo_author, repo_name, page)
        forks_json = json_from_url(forks_url)

        if not forks_json:
            break

        for fork_info in forks_json:
            fork_author = fork_info['owner']['login']
            fork_name = fork_info['name']
            forks_count = fork_info['forks_count']
            fork_url = 'https://github.com/{}/{}'.format(fork_author, fork_name)

            compare_url = 'https://api.github.com/repos/{}/{}/compare/{}...{}:{}'.format(repo_author, fork_name, branch_name, fork_author, branch_name)
            compare_json = json_from_url(compare_url)

            if 'status' in compare_json:
                items = []

                status = compare_json['status']
                ahead_by = compare_json['ahead_by']
                behind_by = compare_json['behind_by']
                total_commits = compare_json['total_commits']
                commits = compare_json['commits']

                if fork_of_fork:
                    items.append('   ')

                items.append(fork_url)
                items.append(status)

                if ahead_by != 0:
                    items.append('+{}'.format(ahead_by))

                if behind_by != 0:
                    items.append('-{}'.format(behind_by))

                if total_commits > 0:
                    last_commit = commits[total_commits-1];
                    commit = last_commit['commit']
                    author = commit['author']
                    date = iso8601_date_to_date(author['date'])
                    items.append('[{}]'.format(date_delta_to_text(root_date, date)))
                    items.append('"{}"'.format(commit['message'].replace('\n', ' ')))

                if ahead_by > 0:
                    print(' '.join(items))

            if forks_count > 0:
                process_repo(fork_author, fork_name, branch_name, True)

        page += 1


def get_commits_json(root_author, root_name, branch_name):
    commits_url = 'https://api.github.com/repos/{}/{}/commits/{}'.format(root_author, root_name, branch_name)
    return json_from_url(commits_url)

url_parsed = urlparse(sys.argv[1].strip())
path_array = url_parsed.path.split('/')
root_author = path_array[1]
root_name = path_array[2]
branch_name = 'master'

root_url = 'https://github.com/{}/{}'.format(root_author, root_name)
commits_json = get_commits_json(root_author, root_name, branch_name)

if commits_json['message'] == 'No commit found for SHA: master':
    branch_name = 'main'
    commits_json = get_commits_json(root_author, root_name, branch_name)
commit = commits_json['commit']
author = commit['author']
root_date = iso8601_date_to_date(author['date'])
print('{} root {} "{}"'.format(root_url, date_to_text(root_date), commit['message'].replace('\n', ' ')));

process_repo(root_author, root_name, branch_name, False)
headkaze
  • 469
  • 4
  • 11
0

Here's a Python script for listing and cloning all forks that are ahead.

It doesn't use the API. So it doesn't suffer from a rate limit and doesn't require authentication. But it might require adjustments if the GitHub website design changes.

Unlike the bookmarklet in the other answer that shows links to ZIP files, this script also saves info about the commits because it uses git clone and also creates a commits.htm file with the overview.

import requests, re, os, sys, time

def content_from_url(url):
    # TODO handle internet being off and stuff
    text = requests.get(url).content
    return text

ENCODING = "utf-8"

def clone_ahead_forks(forklist_url):
    forklist_htm = content_from_url(forklist_url).decode(ENCODING)
    with open("forklist.htm", "w", encoding=ENCODING) as text_file:
        text_file.write(forklist_htm)
        
    is_root = True
    # not working if there are no forks: '<a class="(Link--secondary)?" href="(/([^/"]*)/[^/"]*)">'
    for match in re.finditer('<a (class=""|data-pjax="#js-repo-pjax-container") href="(/([^/"]*)/[^/"]*)">', forklist_htm):
        fork_url = 'https://github.com'+match.group(2)
        fork_owner_login = match.group(3)
        fork_htm = content_from_url(fork_url).decode(ENCODING)
        
        match2 = re.search('([0-9]+ commits? ahead(, [0-9]+ commits? behind)?)', fork_htm)
        # TODO check whether 'ahead'/'behind'/'even with' appear only once on the entire page - in that case they are not part of the readme, "About" box, etc.
        
        sys.stdout.write('.')
        if match2 or is_root:
            if match2:
                aheadness = match2.group(1) # for example '1 commit ahead, 2 commits behind'
            else:
                aheadness = 'root repo'
                is_root = False # for subsequent iterations
                
            dir = fork_owner_login+' ('+aheadness+')'
            print(dir)
            
            if not os.path.exists(dir):
                os.mkdir(dir)
                os.chdir(dir)
                
                # save commits.htm
                commits_htm = content_from_url(fork_url+'/commits').decode(ENCODING)
                with open("commits.htm", "w", encoding=ENCODING) as text_file:
                    text_file.write(commits_htm)
                
                # git clone
                os.system('git clone '+fork_url+'.git')
                print
                
                # no need to recurse into forks of forks because they are all listed on the initial page and being traversed already
                    
                os.chdir('..')
            else:
                print(dir+' already exists, skipping.')

base_path = os.getcwd()
match_disk_letter = re.search(r'^([a-zA-Z]:\\)', base_path)

with open('repo_urls.txt') as url_file:
    for url in url_file:
        url = url.strip()
        url = re.sub(r'\?[^/]*$', '', url) # remove stings like '?utm_source=...' from the end
        print(url)
        match = re.search('github.com/([^/]*)/([^/]*)$', url)
        if match:
            user_name = match.group(1)
            repo_name = match.group(2)
            print(repo_name)
            dirname_for_forks = repo_name+' ('+user_name+')'
            if not os.path.exists(dirname_for_forks):
                url += "/network/members" # page that lists the forks

                TMP_DIR = 'tmp_'+time.strftime("%Y%m%d-%H%M%S")
                if match_disk_letter: # if Windows, i.e. if path starts with A:\ or so, run git in A:\tmp_... instead of .\tmp_..., in order to prevent "filename too long" errors
                    TMP_DIR = match_disk_letter.group(1)+TMP_DIR
                print(TMP_DIR)

                os.mkdir(TMP_DIR)
                os.chdir(TMP_DIR)
                clone_ahead_forks(url)
                print
                os.chdir(base_path)
                os.rename(TMP_DIR, dirname_for_forks)
            else:
                print(dirname_for_forks+' ALREADY EXISTS, SKIPPING.')
        
print('DONE.')

If you make the file repo_urls.txt with the following content (you can put several URLs, one URL per line):

https://github.com/cifkao/tonnetz-viz

then you'll get the following directories each of which contains the respective cloned repo:

tonnetz-viz (cifkao)
  bakaiadam (2 commits ahead)
  chumo (2 commits ahead, 4 commits behind)
  cifkao (root repo)
  codedot (76 commits ahead, 27 commits behind)
  k-hatano (41 commits ahead)
  shimafuri (11 commits ahead, 8 commits behind)

If it doesn't work, try earlier versions.

root
  • 1,812
  • 1
  • 12
  • 26
  • I guess we should add the `--mirror` flag to `git clone` as described [here](https://stackoverflow.com/a/3960063/5231110), right? – root Sep 20 '21 at 15:47
  • Apparently if we had added the `--mirror` flag to `git clone` then the files in the repo wouldn't get downloaded... I don't unterstand... – root Jun 04 '22 at 18:59
0

Here's a Python script for listing and cloning the forks that are ahead. This script partially uses the API, so it triggers the rate limit (you can extend the rate limit (not infinitely) by adding GitHub API authentication to the script, please edit or post that).

Initially I tried to use the API entirely, but that triggered the rate limit too fast, so now I use is_fork_ahead_HTML instead of is_fork_ahead_API. This might require adjustments if the GitHub website design changes.

Due to the rate limit, I prefer the other answers that I posted here.

import requests, json, os, re

def obj_from_json_from_url(url):
    # TODO handle internet being off and stuff
    text = requests.get(url).content
    obj = json.loads(text)
    return obj, text

def is_fork_ahead_API(fork, default_branch_of_parent):
    """ Use the GitHub API to check whether `fork` is ahead.
     This triggers the rate limit, so prefer the non-API version below instead.
    """
    # Compare default branch of original repo with default branch of fork.
    comparison, comparison_json = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo+'/compare/'+default_branch_of_parent+'...'+fork['owner']['login']+':'+fork['default_branch'])
    if comparison['ahead_by']>0:
        return comparison_json
    else:
        return False

def is_fork_ahead_HTML(fork):
    """ Use the GitHub website to check whether `fork` is ahead.
    """
    htm = requests.get(fork['html_url']).content
    match = re.search('<div class="d-flex flex-auto">[^<]*?([0-9]+ commits? ahead(, [0-9]+ commits? behind)?)', htm)
    # TODO if website design changes, fallback onto checking whether 'ahead'/'behind'/'even with' appear only once on the entire page - in that case they are not part of the username etc.
    if match:
        return match.group(1) # for example '1 commit ahead, 114 commits behind'
    else:
        return False

def clone_ahead_forks(user,repo):
    obj, _ = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo)
    default_branch_of_parent = obj["default_branch"]
    
    page = 0
    forks = None
    while forks != [{}]:
        page += 1
        forks, _ = obj_from_json_from_url('https://api.github.com/repos/'+user+'/'+repo+'/forks?per_page=100&page='+str(page))

        for fork in forks:
            aheadness = is_fork_ahead_HTML(fork)
            if aheadness:
                #dir = fork['owner']['login']+' ('+str(comparison['ahead_by'])+' commits ahead, '+str(comparison['behind_by'])+'commits behind)'
                dir = fork['owner']['login']+' ('+aheadness+')'
                print dir
                os.mkdir(dir)
                os.chdir(dir)
                os.system('git clone '+fork['clone_url'])
                print
                
                # recurse into forks of forks
                if fork['forks_count']>0:
                    clone_ahead_forks(fork['owner']['login'], fork['name'])
                    
                os.chdir('..')

user = 'cifkao'
repo = 'tonnetz-viz'

clone_ahead_forks(user,repo)
root
  • 1,812
  • 1
  • 12
  • 26