23

I have a file with unicode symbols (russian text). When I fix some typo I use git diff --color-words=. to see the changes I've done.

In case of unicode (cyrillic) symbols I get some mess with angle brackets like so:

$ cat p1
привет

$ cat p2
Привет

$ git diff --color-words=. --no-index p1 p2
diff --git 1/p1 2/p2
index d0f56e1..d84c480 100644
--- 1/p1
+++ 2/p2
@@ -1 +1 @@
<D0><BF><9F>ривет

It looks like git diff --color-words=. is checking the difference between bytes and not between symbols as I expect.

Is there any way to tell git to work properly with unicode symbols?

UPD about my environment: I get the same on Mac OS and on Linux host.

My shell vars are:

BASH=/bin/bash
HOSTTYPE=x86_64
LANG=ru_RU.UTF-8
OSTYPE=darwin10.0
PS1='\h:\W \u\$ '
SHELL=/bin/bash
SHELLOPTS=braceexpand:emacs:hashall:histexpand:history:interactive-comments:monitor
TERM=xterm-256color
TERM_PROGRAM=iTerm.app
_=-l

I have reset git config to default settings like so:

$ git config -l
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true
core.ignorecase=true

git version

$ git --version
git version 1.7.3.5
chestozo
  • 1,203
  • 1
  • 12
  • 29
  • 3
    That's not `git diff` showing you "angle brackets" but rather `less` - the default pager `git diff` calls. Try exporting `LESSOPTS=-R` or `LESSOPTS=-RX` and see if this helps. There's also a problem with your question: you tell us precisely zero information about your environment. – kostix Jun 26 '13 at 13:34
  • I have updated the question with my env details. Also I tried to config git pager like so: `$ git config --global core.pager "less -R"` and this does not help. – chestozo Jun 26 '13 at 20:09
  • I have tried this also: `$ GIT_PAGER='' git diff --no-index --color-words=. p1 p2` `���ривет` `$ GIT_PAGER='' git diff --no-index --color-words=. --no-color p1 p2` `п�ривет` – chestozo Jun 26 '13 at 20:14
  • 1
    `--word-diff-regex=.` works on byte level and breaks multi-byte character. Sadly this option does not support codepoint range either (tried posix and pcre notation but none worked). – Jokester May 13 '16 at 09:58

6 Answers6

38

For me less — the git pager — was to blame (thanks @kostix). Experiment by disabling the pager altogether:

git --no-pager diff p1 p2

My case was commit messages containing emojis; it's fundamentally the same problem though.

$ git log --oneline
93a1866 <U+1F43C>

$ git --no-pager log --oneline
93a1866 

$ export LESS='--raw-control-chars'
$ git log --oneline
93a1866 

$ git config --global core.pager 'less --raw-control-chars'
$ git log --oneline
93a1866 

NB: the --RAW-CONTROL-CHARS option causes less to pass through ANSI color escapes, but will still munge other control chars (emoji included). My less is globally configured with --RAW-CONTROL-CHARS and my git pager with --raw-control-chars as above.

  • 1
    Important part was `--color-words=.` because I do want to see diff by symbol. And this is what I get running `git --no-pager show --color-words=.`: `���ривет`. Same for `git --no-pager diff --color-words=. --no-index p1 p2`. – chestozo Apr 19 '16 at 14:48
  • 3
    `git config --global core.pager 'less --raw-control-chars'` was just what I needed to fix git-log display issues! – simey.me Jan 16 '17 at 10:19
  • Doesn't fix for me; anybody has other suggestions? I already had `pager = less -FrSX` under `[core]` in my *~/.gitconfig* but it doesn't help. My environment *LANG* is `en_US.UTF-8`; env *LESS* is `-M -I -R` (trying to change `-R` to `-r` doesn't change behaviour). – Kamafeather Feb 23 '19 at 23:27
21

For me best solution to this is setting export LESSCHARSET=utf-8.

In this case both git log -p and git diff shows unicode without problems.

Magomed Abdurakhmanov
  • 1,814
  • 1
  • 16
  • 14
  • doesn't work for me. I have `pager = less -rFX` in `.gitconfig` and this is what I get https://d17oy1vhnax1f7.cloudfront.net/items/2p3703271r0m060s1J34/s.png?v=2d60f213 – chestozo Dec 25 '16 at 20:45
  • what do you see for `git diff --color-words=.` ? – chestozo Dec 25 '16 at 20:46
  • Do you have LANG set? Mine is `LANG=en_US.UTF-8` – Magomed Abdurakhmanov Dec 25 '16 at 22:10
  • Here is what I get with `git diff` and `git diff --color-words=.` https://www.dropbox.com/s/2wt9iysevw2xeyn/Screenshot%202016-12-25%2023.09.02.png?dl=0 – Magomed Abdurakhmanov Dec 25 '16 at 22:11
  • Here is when you have changed Cyrillic word https://www.dropbox.com/s/1x67c5jdngrhgp2/Screenshot%202016-12-25%2023.15.24.png?dl=0 – Magomed Abdurakhmanov Dec 25 '16 at 22:16
  • same here `$ echo $LANG >>> en_US.UTF-8`. I see this answer also here - http://stackoverflow.com/a/19436421/449345 - but somehow it doesn't help :/ – chestozo Dec 25 '16 at 22:50
  • 1
    This solved it for me. Created a new environment variable `LESSCHARSET` set to `utf-8` and `git log`/`diff` now displays norwegian letters ÆØÅ properly insted of `<85>` etc. OS: Windows 10 – ardal Aug 13 '17 at 10:05
3

The solution for me was to use git difftool.

I wrote this tool https://github.com/chestozo/dmp based on https://code.google.com/p/google-diff-match-patch/.

Sometimes it also gives better diff comparing to git diff --color-words=. :)

chestozo
  • 1,203
  • 1
  • 12
  • 29
3

For several platforms setting LANG to C.UTF-8 (or en_US.UTF-8, etc.) would work:

$ echo '人' >test1.txt && echo '丁' >test2.txt
$ LANG=C.UTF-8 git diff --no-index --word-diff=plain --word-diff-regex=. -- test1.txt test2.txt
diff --git a/test1.txt b/test2.txt
index 3ef0891..3773917 100644
--- a/test1.txt
+++ b/test2.txt
@@ -1 +1 @@
[-人-]{+丁+}

However, LANG doesn't seem to be honored on some platforms (such as Git for Windows):

$ echo '人' >test1.txt && echo '丁' >test2.txt
$ LANG=C.UTF-8 git diff --no-index --word-diff=plain --word-diff-regex=. -- test1.txt test2.txt
diff --git a/test1.txt b/test2.txt
index 3ef0891..3773917 100644
--- a/test1.txt
+++ b/test2.txt
@@ -1 +1 @@
<E4>[-<BA><BA>-]{+<B8><81>+}

A workaround on these platforms is to provide raw bytes for UTF-8 chars (e.g. $'[^\x80-\xBF][\x80-\xBF]*' for '.') to git diff:

$ echo '人' >test1.txt && echo '丁' >test2.txt
$ git diff --no-index --word-diff=plain --word-diff-regex=$'[^\x80-\xBF][\x80-\xBF]*' -- test1.txt test2.txt
diff --git a/test1.txt b/test2.txt
index 3ef0891..3773917 100644
--- a/test1.txt
+++ b/test2.txt
@@ -1 +1 @@
[-人-]{+丁+}
Danny Lin
  • 2,050
  • 1
  • 20
  • 34
  • The LANG var in my case is `en_US.UTF-8` and it doesn't help much. `git diff --color-words=. --word-diff-regex=$'[^\x80-\xBF][\x80-\xBF]*'` is a good one! thank you ) will check it out! – chestozo May 13 '18 at 08:16
  • 1
    If you are using --word-diff-regex, it'd be better to use --word-diff=color instead of --color-words (which is a combination of both). Furthermore, you can set `diff.wordRegex` so that you can provide only --word-diff=color in the future, and git will use the configured regex for word diff. – Danny Lin May 13 '18 at 08:22
  • Be careful with rendering UTF8 chars in git diffs. Its very possible to sneak some malicious code by hiding it with dubious UTF8 characters (zero width characters i'm looking at you) – Yarek T Aug 13 '18 at 16:12
1

toolbear's answer didn't work for me, since even with git --no-pager diff I saw unreadable characters as well (not brackets, but unreadable), so less was not the core problem.

I tried a ton of things, but the only thing, which helped is to include into .git\config explicit conversion from Cyrillic to utf-8 (I'm using windows 7)

[pager]
diff = iconv.exe -f cp1251 -t utf-8 | less  

note, I change specifically pager.diff here, since I had encoding problems only with diff command. For some weird reason log and reflog was working fine with me. But if you have encoding problems with other commands too, you should change pager for all the commands, like this:

[core]
...
pager = iconv.exe -f cp1251 -t utf-8 | less 
klm123
  • 12,105
  • 14
  • 57
  • 95
0

I have seen a lot of reports xterm is not really able to print Unicode characters in some cases. Maybe at least a starting point for a solution.

frlan
  • 6,950
  • 3
  • 31
  • 72
  • 1
    I this case the problem is in 2 bytes, used to represent a unicode symbol in shell. While `git diff` only knows to deal with 1 bytes symbols. I am not sure it is xterm problem. – chestozo Nov 23 '13 at 15:56
  • Well... Doesn't look like ... at least this worked for me diff --git a/README b/README index e69de29..b562a56 100644 --- a/README +++ b/README @@ -0,0 +1 @@ +µÜäčřúůжжвыаьь – frlan Nov 23 '13 at 18:25