(First, big thanks for the reproducer—it was helpful—but one note: watch out, your quotes got mangled into "smart quotes" instead of plain double quotes. I fixed them.)
I would like to show only changes to the column headers of a csv file tracked by git.
Based on the example, by "column headers" I take it you mean "line 1".
The basic problem starts here:
git log --format=format:%H $FILE | ...
This finds, and prints the hash ID of, each occurrence of a commit that changes anything in the given file. (FILE
needs to be set to table.csv
here.) This is not at all what you want! Its only function is to completely skip any commit where the file is entirely un-changed (which could be a useful function in real world examples, but not so much in your reproducer since every commit changes the file here.)
(Side note: whenever it's possible, use git rev-list
instead of git log
. It's possible here. However, we're going to end up discarding git log
/ git rev-list
anyway. But see footnote / separate section below.)
... | xargs -L 1 git blame $FILE -L $LINE,$LINE
(Here, LINE
needs to be set to 1.) The general idea here seems to be to run git blame
on one specific line (in this case line 1), which is fine as far as it goes, but isn't really want we want. If our left-side command, git log ... $FILE
, had selected just the revisions we want, those would already be the revisions we want and we could just stop here.
The real trick here is to run git blame
repeatedly but only until the blame "runs out". Each invocation of git blame
should tell us who / which commit is "responsible for" (i.e., produced this version of) the given line, and that's exactly what git blame
does. You give it a starting (ending?—Git works backwards, so we start at the end and work backwards) revision, and Git checks that version and the previous commit to see if the line in question changed in that version. If so, we're done: we print that version and the line. If not, we put the previous version in place and repeat. We do this until we run out of "previous versions", in which case we just print this version and stop.
So git blame
is already doing what you want. The only problem is that it stops after it finds the "previous version" to print. So what we really want is to build a loop:
do {
rev, other-info, output = <what git blame does>
print rev and/or output in appropriate format
} while other-info says there are previous revs
The way to deal with this is to use --porcelain
(or --incremental
but --porcelain
seems most appropriate here). We know that -L 1,1
(or -L $LINE,$LINE
) is going to output a single line at the end. We want to collect the remaining lines. The output from --porcelain
is described in the documentation: it's a series of lines with, in our case, the first and last being of interest, and the middle ones might be interesting, or might not, except that previous
or boundary
is always of interest.
Shell parsing is kind of messy, so it's probably best to use some other language to handle the output from git blame
. For instance, we might use a small Python program. This one doesn't have many features but shows how to use --porcelain
here, and should be easy to modify. It has been very lightly tested (and run through black for formatting and mypy for type checking, but definitely needs better error handling. For instance, running it with a nonexistent pathname gets you a fatal
error message, but then a Python traceback. I leave the cleanup to someone else, at this point.
#! /usr/bin/env python3
"""
Analyze "git blame" output and repeat until we reach the boundary.
"""
import argparse
import subprocess
import sys
def blame(path: str, args: argparse.Namespace) -> None:
rev = "HEAD"
while True:
cmd = [
"git",
"blame",
"--porcelain",
f"-L{args.line},{args.line}",
rev,
"--",
path,
]
# if args.debug:
# print(cmd)
proc = subprocess.Popen(
cmd, shell=False, universal_newlines=True, stdout=subprocess.PIPE,
)
assert proc.stdout is not None
info = proc.stdout.readline().split()
rev = info[0]
kws = {}
match = None
for line in proc.stdout:
line = line.rstrip("\n")
if line.startswith("\t"):
# here's our match, there won't be anything else
match = line
else:
parts = line.split(" ", 1)
kws[parts[0]] = parts[1] if len(parts) > 1 else None
status = proc.wait()
if status != 0:
print(f"'{' '.join(cmd)}' returned {status}")
# found something useful
print(f"{rev}: {match}")
if "boundary" in kws:
break
prev = kws["previous"]
assert prev is not None
parts = prev.split(" ", 1)
assert len(parts) == 2
rev = parts[0]
path = parts[1]
def main() -> int:
parser = argparse.ArgumentParser("foo")
parser.add_argument("--line", "-l", type=int, default=1)
parser.add_argument("files", nargs="+")
args = parser.parse_args()
for path in args.files:
blame(path, args)
return 0
if __name__ == "__main__":
try:
sys.exit(main())
except KeyboardInterrupt:
sys.exit("\nInterrupted")
[Edit: this program badly needs a few checks for when Git doesn't run or git blame
does not find the file or line. In particular proc.stdout.readline()
gets end-of-file and returns an empty string. Use with caution, fix it up, or don't use it at all.]
Using git log
directly
This may not have the output format you want, but note that git log
can do just what you want without having to write a bunch of new code:
git log --oneline -L1,1:table.csv
(or leave out the --oneline
if you like). The -L
directive takes two line numbers and a file name, or various other option formats, and does the same "find commits that modify the file" search that you were using git log table.csv
for in the first place, but restricts the output still further, to show only those files where the specified lines change.
Add --no-patch
and an appropriate set of format directives, and you can get the commit hash IDs and whatever else you like, and then use some program to extract the lines from the specific files (e.g., git cat-file -p rev:path | sed -n -e "$line{p;q;}"
).
Note that git log
is what Git calls a porcelain command (vs git rev-list
or git blame --porcelain
acting as what Git calls a plumbing command). Porcelain commands generally obey Git configurations, such as the settings for color.ui
, core.pager
, and log.pager
, and settings like log.decorate
. This makes them hard to use from other programs, as it's hard to know whether something will be colorized (with ESC [ 31 m sequences for instance). Plumbing programs behave in a well-defined manner so that other programs can know exactly what input to expect. This is why we normally want to use git rev-list
rather than git log
when writing scripts, if we're doing something that both commands can do.