3

After yet another git pull my project stopped building with bunch of messages:

error: unmappable character for encoding UTF-8

The messages point to the copyright symbol found in some of the files headers. There are many more files with same symbol but they seem to compile fine. When viewing in binary editor the good one appears as:

C2 A9

while bad one

A9

When viewing in vim both are shown as © (<©> 169, Hex 00a9, Octal 251) but IntelliJ Idea shows bad ones as diamond.

So I decided that I have messed something when merging (there were merge conflicts after pull) and went to look what files where changed with

git diff-tree --no-commit-id --name-only -r --full-index --binary 91cbe7b753d39905372c1ea41e04e7a3dbd2566e

but it produces no results. No changes found for the previous commit too. The log looks like this:

commit 91cbe7b753d39905372c1ea41e04e7a3dbd2566e
Merge: d7b4ae9 0dfc198
Author: Me Me <my.my@gmail.com>
Date:   Wed Dec 23 17:50:46 2015 +0100

    Merge branch 'development' of ssh://fsstash.cool.com:7999/our/server into my-branch

commit 0dfc19850b2e31d72c1d2923321430e8fc1b53cb
Merge: 724b8a7 d3478f9
Author: Good Guy <Good.Guy@gmail.com>
Date:   Wed Dec 23 14:34:33 2015 +0200

    Merge branch 'development' of ssh://fsstash.cool.com:7999/our/server into development

when I do git checkout 0dfc19850b2e31d72c1d2923321430e8fc1b53cb everything compiles fine.

So the question is: how can I fix it?

By fix I mean understanding what's happend and reapplying the pull changes (maybe) so that I wouldn't have to commit anything related to this fix into upstream repo.

It seems like the bad one is UTF-16 (0x00A9) while good one is UTF-8 - (0xC2 0xA9). What might have changed it?

Build system is maven, but it's not related as same error reported by bare javac on copied and minified file. The os is ubuntu 15.10, locale says this:

locale
LANG=ru_RU.UTF-8
LANGUAGE=ru:en
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC=ru_UA.UTF-8
LC_TIME=ru_UA.UTF-8
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY=ru_UA.UTF-8
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER=ru_UA.UTF-8
LC_NAME=ru_UA.UTF-8
LC_ADDRESS=ru_UA.UTF-8
LC_TELEPHONE=ru_UA.UTF-8
LC_MEASUREMENT=ru_UA.UTF-8
LC_IDENTIFICATION=ru_UA.UTF-8
LC_ALL=

java -version: 1.8.0_66.

Any help is highly appreciated!

PS: tried all --diff-algorithm={patience|minimal|histogram|myers} - still no changes found by git-diff-tree

PS: git reset --hard HEAD~1, git pull origin developemnt issued from the command line didn't help, so not related to Idea.

user656449
  • 2,950
  • 2
  • 30
  • 43
  • What tool did you use to resolve the conflicts? vim? – CB Bailey Dec 24 '15 at 21:53
  • try to replace the copyright symbol as the escape code `&copy`. – seoyoochan Dec 25 '15 at 03:27
  • but that would mean I'll have to commit the changes which I'd really want to avoid - each commit gets reviewed and I should explain somehow why did I change those copyrights. And there is definitely something wrong with my git and/or IntelliJ Idea and I want to know what - as I said there are many more files with copyrithg, and what if next time other portions will be screwed? – user656449 Dec 25 '15 at 06:46
  • @user656449 figuring out the root cause and preventing it is great for the future, but you have a file with the wrong contents, you're not going to fix that *without* a commit, so get over that part :) – hobbs Dec 25 '15 at 06:58

2 Answers2

1

the git diff-tree appeared to be the wrong diff to use in this case. The git diff --name-only a35f25470bc8219e3f2a45316963dde660091bcb 0dfc19850b2e31d72c1d2923321430e8fc1b53cb

revealed a lot of changes between the branches and one of them update of maven-compiler-plugin configuration which changed the java version from 7 to 8. And it looks like javac 8 treats encoding as errors whereas 7 as warning (although writes absolutely identical "error: unmappable character for ..." warning to the log.

user656449
  • 2,950
  • 2
  • 30
  • 43
1

git diff --name-only is indeed more suited for parsing, as shown with Git 2.32 (Q2 2021), which clarifies that pathnames recorded in Git trees are most often (but not necessarily) encoded in UTF-8.

See commit 9364bf4 (20 Apr 2021) by Andrey Bienkowski (hexagonrecursion).
(Merged by Junio C Hamano -- gitster -- in commit 93e0b28, 30 Apr 2021)

doc: clarify the filename encoding in git diff

AFAICT parsing the output of git diff --name-only master...feature(man) is the intended way of programmatically getting the list of files modified by a feature branch.

It is impossible to parse text unless you know what encoding it is in.

diff-options now includes in its man page:

Show only names of changed files. The file names are often encoded in UTF-8.

diff-options now includes in its man page:

Just like --name-only the file names are often encoded in UTF-8..

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250