All those answers work well with ASCII characters, like insysion's one:
git log --pretty=format:'%w(79, 0, 29)%h %<(20,trunc)%an %s' -10
But that breaks down for UTF-8 characters.
With Git 2.39 (Q4 2022), "git diff --stat
"(man) etc.
were invented back when everything was ASCII and strlen()
was a way to measure the display width of a string; adjust them to compute the display width assuming UTF-8 pathnames.
See commit ce8529b (21 Oct 2022) by Junio C Hamano (gitster
).
See commit 12fc4ad (14 Sep 2022) by Torsten Bögershausen (tboegi
).
(Merged by Junio C Hamano -- gitster
-- in commit 7d5a4d8, 28 Oct 2022)
diff.c
: use utf8_strwidth()
to count display width
Reported-by: Alexander Meshcheryakov
Helped-by: Johannes Schindelin
Signed-off-by: Torsten Bögershausen
When Unicode filenames (encoded in UTF-8) are used, the visible width on the screen is not the same as strlen()
.
For example, git log --stat
(man) may produce an output like this:
[snip the header]
Arger.txt | 1 +
Ärger.txt | 1 +
2 files changed, 2 insertions(+)
A side note: the original report was about cyrillic filenames.
After some investigations it turned out that
- a) This is not a problem with "ambiguous characters" in Unicode
- b) The same problem exists for all Unicode code points (so we can use Latin based Umlauts for demonstrations below)
The 'Ä
' takes the same space on the screen as the 'A
'.
But needs one more byte in memory, so the the git log --stat
output for "Arger.txt
" (!) gets misaligned: The maximum length is derived from "Ärger.txt
", 10 bytes in memory, 9 positions on the screen.
That is why "Arger.txt
" gets one extra '
' for alignment, it needs 9 bytes in memory.
If there was a file "Ö
", it would be correctly aligned by chance, but "Öhö
" would not.
The solution is of course, to use utf8_strwidth()
instead of strlen()
when dealing with the width on screen.
And then there is another problem, code like this: strbuf_addf(&out, "%-*s", len, name);
(or using the underlying snprintf()
function) does not align the buffer to a minimum of len
measured in screen-width, but uses the memory count.
One could be tempted to wish that snprintf()
was UTF-8 aware.
That doesn't seem to be the case anywhere (tested on Linux and Mac), probably snprintf()
uses the "bytes in memory"/strlen()
approach to be compatible with older versions and this will never change.
The basic idea is to change code in diff.c
like this strbuf_addf(&out, "%-*s", len, name);
into something like this:
int padding = len - utf8_strwidth(name);
if (padding < 0)
padding = 0;
strbuf_addf(&out, " %s%*s", name, padding, "");
The real change is slightly bigger, as it, as well, integrates two calls of strbuf_addf()
into one.
Tests: Two things need to be tested:
- The calculation of the maximum width
- The calculation of padding
The name "textfile
" is changed into "tëxtfilë
", both have a width of 8.
If strlen()
was used, to get the maximum width, the shorter "binfile
" would have been mis-aligned:
binfile | [snip]
tëxtfilë | [snip]
If only "binfile
" would be renamed into "binfilë
":
binfilë | [snip]
textfile | [snip]
In order to verify that the width is calculated correctly everywhere,
- "
binfile
" is renamed into "binfilë
", giving 1 bytes more in strlen()
- "
tëxtfile
" is renamed into "tëxtfilë
", 2 byte more in strlen()
.
The updated t4012-diff-binary.sh
checks the correct alignment:
binfilë | [snip]
tëxtfilë | [snip]
The output from "git diff --stat
"(man) on an unmerged path lost the terminating LF in Git 2.39, which has been corrected with Git 2.40 (Q1 2023).
See commit 209d9cb (14 Dec 2022) by Peter Grayson (jpgrayson
).
(Merged by Junio C Hamano -- gitster
-- in commit e57caee, 26 Dec 2022)
diff
: fix regression with --stat
and unmerged file
Signed-off-by: Peter Grayson
Acked-by: Jeff King
A regression was introduced in 12fc4ad (diff.c
: use utf8_strwidth() to count display width, 2022-09-14, Git v2.39.0-rc0 -- merge listed in batch #8).
That causes missing newlines after "Unmerged" entries in git diff --cached --stat
(man) output.
This problem affects v2.39.0-rc0 through v2.39.0.
Add the missing newline along with a new test to cover this behavior.