0

I have a pre-commit hook that runs

files=`git diff --cached --name-only --diff-filter=ACMR | grep -E "$extension_regex"`

and performs some formatting on those files before committing.

However, I have some files that contain non-ascii letters, and realized those files weren't being formatted.

After some debugging, found that it was because git diff outputted those file names with escaped characters and surrounded with double quotes, for example:

"\341\203\236\341\203\220\341\203\240\341\203\220\341\203\233\341\203\224\341\203\242\341\203\240\341\203\224\341\203\221\341\203\230.ext"

I tried to modify the regex pattern to accept names surrounded with quotes, and even tried removing those quotes, but anywhere I try to access the file it can't be found, for example:

$ cat $file
cat: '"\341\203\236\341\203\220\341\203\240\341\203\220\341\203\233\341\203\224\341\203\242\341\203\240\341\203\224\341\203\221\341\203\230.ext"': No such file or directory

$ file="${file:1:${#file}-2}"

$ cat $file
cat: '\341\203\236\341\203\220\341\203\240\341\203\220\341\203\233\341\203\224\341\203\242\341\203\240\341\203\224\341\203\221\341\203\230.ext': No such file or directory

How do I handle files with non ascii characters?

Guiorgy
  • 1,405
  • 9
  • 26
  • If you're planning to parse or otherwise automatically use the output of a git command, it's better to use [a plumbing command](https://stackoverflow.com/questions/6976473/what-does-the-term-porcelain-mean-in-git) like `git diff-index` instead of a porcelain one like `git diff`. The latter is optimized for human consumption, whereas the former has a fixed output format thats optimized for software consumption. – Joachim Sauer Jul 13 '23 at 13:07
  • good tip, I'll replace it, but the output is the same in this case so my problem remains – Guiorgy Jul 13 '23 at 13:12

1 Answers1

2

You can use the -z option to get nul termination instead of the C string literal quoting to deal with non-ASCII characters in paths.

files=$(
    git diff -z --cached --name-only --diff-filter=ACMR \
    | grep -Ez "$extension_regex" \
    | tr \\0 \\n
)

utf-8 is still not completely universal and may never be, filesystems are so disparate that anything beyond ASCII is not entirely portable. Git's playing it annoyingly safe with its default to encoding anything that won't roundtrip in ASCII using C string literal conventions, but its choice does have that safe roundtrippability going for it which basically nothing else does (at least not yet) so there's that.

If you're not worried about completely unconstrained file names, in particular if you don't need to handle file names containing their own \n's, newlines, you can hike the tr up a step and remove the -z option on the grep, or drop the -z option entirely, turn core.quotepath off. From the command line:

git -c core.quotepath=false diff --name-only | grep etc

or in the configs.

jthill
  • 55,082
  • 5
  • 77
  • 137
  • The only nitpick I have is that it would be perfect if you had some simple explanation on the `tr` command, and how `quotepath=false` is "less safe" in this case, or in general. As a Git Bash user on Windows, I am not too familiar with Unix commands, but from what I can tell from --help, it translates/replaces the trailing null termination with newline character to be inline with the usual `diff` output format. As for quotepath, what does `are not considered "unusual" any more` exactly mean, and how is it different from `do not munge pathnames and use NULs as output field terminators.`? – Guiorgy Jul 14 '23 at 10:55
  • Actually, the first option (using -z) outputs the list of files without the space separator with or without the `tr` command, e.g. `files: tmp 1.jsontmp 2.json`, which completely messes the script when there are multiple matches. The second option (quotepath) on the other hand does seem to work, e.g. `files: tmp 1.json tmp 2.json`. Hope you edit the answer – Guiorgy Jul 14 '23 at 11:25
  • 1
    `-z` outputs the list with `\0` line terminators, and `\0`, the nul character, does not display at all. pipe it through `cat -A` to have the nuls represented in a way a terminal will display (the historical function of nuls, [on devices you can still find in use](https://youtu.be/rnR-h51wh6s), is as time fillers on the communication link waiting for the print head to return to the start of the line) – jthill Jul 14 '23 at 19:09
  • The way I talked about the difference between core.quotepath -z got sloppy, sorry for the confusion there. Thanks for pointing me at that, there's a first fix in now.. – jthill Jul 14 '23 at 19:18
  • I figured it out, I was using backticks to suround the expression (\`) which resulted in an output without separators, doing it inside $() worked as intended! Looks like I need to do some reading on Bash scripts :D. On a last note, theres an extra backtick in your first code block, tried to edit nut need to change at least 6 characters to submit an edit. – Guiorgy Jul 15 '23 at 11:09
  • backticks or `$()`, they're both syntax for delimiting text for command expansion; delimit the same text you get the same result. You're seeing something else. – jthill Jul 15 '23 at 12:04
  • After further testing `| tr \\0 \\n` resulted in `warning: command substitution: ignored null byte in input`, after doubling back slashes `| tr \\\\0 \\\\n` it worked. [Exactly how do backslashes work within backticks?](https://stackoverflow.com/q/57447644), looks like the backslash was being consumed inside the backticks. The more you know... – Guiorgy Jul 15 '23 at 13:15
  • Thanks for sticking with that, I had never noticed that little weirdity. ``echo $(echo -\\n-)`` and ``echo `echo -\\n-` `` don't print the same thing. – jthill Jul 15 '23 at 16:38