Efficient retrieval of releases that contain a commit

Question

In the command line, if I type

git tag --contains {commit}

to obtain a list of releases that contain a given commit, it takes around 11 to 20 seconds for each commit. Since the target code base there exists more than 300,000 commits, it would take a lot to retrieve this information for all commits.

However, gitk apparently manages to do a good job retrieving this data. From what I searched, it uses a cache for that purpose.

I have two questions:

How can I interpret that cache format?
Is there a way to obtain a dump from the git command line tool to generate that same information?

Would implementing your own cli cache functionality work for you? If so, I think I can come up with some ideas for that. — Alexander Bird, Jun 05 '12 at 00:38

jthill · Accepted Answer · 2012-06-05T20:17:49.790

You can get this almost directly from git rev-list.

latest.awk:

BEGIN { thiscommit=""; }
$1 == "commit" {
    if ( thiscommit != "" )
        print thiscommit, tags[thiscommit]
    thiscommit=$2
    line[$2]=NR
    latest = 0;
    for ( i = 3 ; i <= NF ; ++i ) if ( line[$i] > latest ) {
        latest = line[$i];
        tags[$2] = tags[$i];
    }
    next;
}
$1 != "commit"  { tags[thiscommit] = $0; }
END { if ( thiscommit != "" ) print thiscommit, tags[thiscommit]; }

a sample command:

git rev-list --date-order --children --format=%d --all | awk -f latest.awk

you can also use --topo-order, and you'll probably have to weed out unwanted refs in the $1!="commit" logic.

Depending on what kind of transitivity you want and how explicit the listing has to be, accumulating the tags might need a dictionary. Here's one that gets an explicit listing of all refs for all commits:

all.awk:

BEGIN {
    thiscommit="";
}
$1 == "commit" {
    if ( thiscommit != "" )
        print thiscommit, tags[thiscommit]
    thiscommit=$2
    line[$2]=NR
    split("",seen);
    for ( i = 3 ; i <= NF ; ++i ) {
        nnew=split(tags[$i],new);
        for ( n = 1 ; n <= nnew ; ++n ) {
            if ( !seen[new[n]] ) {
                tags[$2]= tags[$2]" "new[n]
                seen[new[n]] = 1
            }
        }
    }
    next;
}
$1 != "commit"  {
    nnew=split($0,new,", ");
    new[1]=substr(new[1],3);
    new[nnew]=substr(new[nnew],1,length(new[nnew])-1);
    for ( n = 1; n <= nnew ; ++n )
        tags[thiscommit] = tags[thiscommit]" "new[n]

}
END { if ( thiscommit != "" ) print thiscommit, tags[thiscommit]; }

all.awk took a few minutes to do the 322K linux kernel repo commits, about a thousand a second or something like that (lots of duplicate strings and redundant processing) so you'd probably want to rewrite that in C++ if you're really after the complete cross-product ... but I don't think gitk shows that, only the nearest neighbors, right?

So to clarify for us non-awk users: these scripts do the exact same thing as `git tag --contains {commit}`? — Alexander Bird, Jun 08 '12 at 01:32
rev-list's %d shows all ref not just tags so all.awk gets all tags and branches not just tags, but other than that yeah, all.awk is a batch --contains ... btw it won't take two hours to learn awk. — jthill, Jun 08 '12 at 17:33

Efficient retrieval of releases that contain a commit

1 Answers1

Linked