69

I'm having some trouble getting unicode to work for git-bash (on windows 7). I have tried many things without success. Although, I'm not quite sure what is responsible to for this so i might be working in the wrong direction.

It really seems this should be possible as the encoding for cmd.exe can be changed to unicode with 'chcp 65001'.

Here are some things I've tried (besides the obvious of looking through the configuration options in the GUI).

  1. Setting environment variables in '.bashrc'. I guess it makes sense this doesn't work since i think it's a linux thing. The 'locale' command does not exist.

    export LC_ALL=en_US.UTF-8
    export LANG=en_US.UTF-8
    export LANGUAGE=en_US.UTF-8
    
  2. Starting out in cmd.exe, changing the encoding to unicode with 'chcp 65001' and then starting up git-bash. This causes me to get a permission denied when trying to cat my unicode test file. However, catting a file without unicode works just fine. As demonstrated, dropping back out to cmd.exe i can still "cat" the file. Using my default encoding (437) i can cat the file in bash (no permission denied but the output is fudged).

    S:\>chcp 65001
    Active code page: 65001
    S:\>"C:\Program Files (x86)\Git\bin\sh.exe" --login -i
    zarac@TOWELIE /z
    cat /s/unicode.txt
    cat: write error: Permission denied
    zarac@TOWELIE /z
    cat /s/nounicode.txt
    abc
    zarac@TOWELIE /z
    L /s/unicode.txt
    -rw-r--r--    1 zarac    Administ        7 May 18 10:30 /s/unicode.txt
    zarac@TOWELIE /z
    whoami
    towelie\zarac
    zarac@TOWELIE /z
    exit
    Z:\>type S:\unicode.txt
    abc£
    
  3. Using the /U flag when starting the shell (makes sense that it doesn't work because it's not quite what it's for if-i-understand-correctly, but it has to do with unicode so i tried it).

    C:\Windows\SysWOW64\cmd.exe /U /C "C:\Program Files (x86)\Git\bin\sh.exe" --login -i
    
  4. As I prefer to use Console2, I've tried adding a dword value named CodePage with the value 65001 (decimal) to the windows registry under [HKEY_CURRENT_USER\Console] as well as [HKEY_CURRENT_USER\Console\Git Bash]. This seems to have the same effect as setting 'chcp 65001' accept that it's "automatic". (http://stackoverflow.com/questions/379240/is-there-a-windows-command-shell-that-will-display-unicode-characters)

  5. JPSoft's TCC/LE

  6. PowerCMD

  7. stackoverflow

  8. duckduckgo

  9. ixquick / google

So, method 2 seems viable if that permission issue can be fixed. However, I'm open to pretty much any solution although i prefer if i can use Console2 (due mostly to it's nifty tab feature). Perhaps one solution would be to setup an SSH server and then use Putty/Kitty to connect to it, but that's just wrong! ; )

PS. Is there any official documentation for git-bash?

Hannes
  • 1,871
  • 3
  • 15
  • 20
  • 1
    msysgit 1.7.10 handles unicode correctly. See [this page](https://github.com/msysgit/msysgit/wiki/Git-for-Windows-Unicode-Support) for official documentation – CharlesB May 18 '12 at 11:47
  • 9
    `I'm open to pretty much any solution`: Purge evil, Install linux, ???, Profit!!! :P Sorry – KurzedMetal May 18 '12 at 11:53
  • What about using Cygwin and rxvt? – KurzedMetal May 18 '12 at 11:54
  • Thanks for your answers and your edit CharlesB! – Hannes May 18 '12 at 12:04
  • In case it wasn't clear, thanks to you too KurzedMetal. ;) – Hannes May 18 '12 at 16:01
  • 4
    The problem with `chcp 65001` is that there are bugs in the C runtime (MSVCRT) that make stdio calls return inconsistent results when run under code page 65001. This is why 65001 is not available to pick as an ANSI code page from the Regional And Language Options dropdown. For apps compiled against other runtimes you can get away with it, but many native Windows apps will crash and burn. – bobince May 21 '12 at 18:06
  • @Hannes - nkatsar's answer below is the one that answered your question. You should consider changing your accepted answer. – jww Feb 19 '17 at 04:40

9 Answers9

66

I faced the same issue in MSYS Git 2.8.0 and as it turned out it just needed changing the configuration.

$ git --version

git version 2.8.0.windows.1

The default configuration of Git Bash console in my system did not show Greek filenames.

$cd ~

$ls

AppData/
'Application Data'@
Contacts/
Cookies@
Desktop/
Documents/
Downloads/
Favorites/
Links/
'Local Settings'@
NTUSER.DAT
.
.
.
''$'\316\244\316\261'' '$'\316\255\316\263\316\263\317\201\316\261\317\206\316\254'' '$'\316\274\316\277\317\205'@

The last line should display "Τα έγγραφά μου", the greek translation of "My Documents". In order to fix it I followed the below steps:

  1. Check your existing locale configuration

    $locale
    
    LANG=en
    LC_CTYPE="C"
    LC_NUMERIC="C"
    LC_TIME="C"
    LC_COLLATE="C"
    LC_MONETARY="C"
    LC_MESSAGES="C"
    LC_ALL=
    

    As shown above, in my case it was not UTF-8

  2. Change the locale to a UTF-8 encoding. Click the icon on the left side of MINGW title bar, select "Options" and in the "Text" category choose "UTF-8" Character set. You should also choose a unicode font, such as the default "Lucida Console". My configuration looks as following: MinGW locale configuration

  3. Change the language for the current window (no need to do this on future windows, as they will be created with the settings of step 2)

     $ LANG='C.UTF-8'
    
  4. The ls command should now display properly

    AppData/
    'Application Data'@
    Contacts/
    Cookies@
    Desktop/
    Documents/
    Downloads/
    Favorites/
    Links/
    'Local Settings'@
    NTUSER.DAT
    .
    .
    .
    'Τα έγγραφά μου'@
    
nkatsar
  • 1,580
  • 17
  • 15
  • 1
    Just adding on this. All these settings are stored in ~/.minttyrc – Vargas Jul 19 '16 at 15:14
  • 5
    You are the nly one who actually answered the question. – jww Feb 19 '17 at 04:39
  • @jww Glad to know I could be of help – nkatsar Feb 20 '17 at 11:43
  • My locale is UTF-8 and `ls` show the file in correct form `初めに.txt` but `git status` still broken `σê¥πéüπü½.txt`, can you give it a look? https://gist.github.com/long-nguyenxuan/18c2e85bf29f1fb91cef0250bd5082ec – Luke Aug 07 '20 at 10:33
20

Found this answer elsewhere:

chcp.com 65001

Git bash chcp windows7 encoding issue

That's what actually solved it for me.

TravisChambers
  • 526
  • 4
  • 12
10

As CharlesB said in a comment, msysgit 1.7.10 handles unicode correctly. There are still a few issues but I can confirm that updating did solve the issue I was having.

See: https://github.com/msysgit/msysgit/wiki/Git-for-Windows-Unicode-Support

Hannes
  • 1,871
  • 3
  • 15
  • 20
  • 1
    How does one enable it? I'm using Git Bash 2.11, and its not using UTF-8 on Windows 8 or Windows 10. My outputs are often polluted with incorrect characters, and I'd like to clear the issue. – jww Feb 19 '17 at 04:31
  • @jww After upgrading from Git 2.9.2 to 2.15.0 just now I found it had essentially regressed, showing the UTF-8 encoding (e.g. `<80><99>` for `’`) in reverse highlighting (that is, black on white rather than white on black) rather than the actual character as it had formerly done. I fixed it by setting the environment variable `LC_ALL=en_AU.utf8` globally. Note that I use `chcp 65001` (the pseudo-UTF-8 codepage for conhost). This is fairly similar to what the top-voted answer on this question says to do. – Chris Morgan Nov 09 '17 at 05:47
6

Check if the issue persists with Git 2.1 (August 2014).
See commit 617ce96 or commit 1c950a5 by Karsten Blees (kblees)

Win32: support Unicode console output

WriteConsoleW seems to be the only way to reliably print unicode to the console (without weird code page conversions).

Also redirects vfprintf to the winansi.c version.

Win32: add Unicode conversion functions

Add Unicode conversion functions to convert between Windows native UTF-16LE encoding to UTF-8 and back.

To support repositories with legacy-encoded file names, the UTF-8 to UTF-16 conversion function tries to create valid, unique file names even for invalid UTF-8 byte sequences, so that these repositories can be checked out without error.

It is likely to be a port of something already integrated in msysgit, but at least that means the Windows version of Git won't have to diverge/patch from the main Git repo source code in order to include those improvements.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • How does one enable UTF-8? I'm using Git Bash 2.11, and its not using UTF-8 on Windows 8 or Windows 10. My outputs are often polluted with incorrect characters, and I'd like to clear the issue. – jww Feb 19 '17 at 04:33
4

I can see that there are some problems with character encoding with git bash for windows. Less for the work with git itself and the tools it ships with (curl, cat, grep etc.). I didn't run into problems with these over the years character encoding related.

Normally with each new version problems get better resolved. E.g. with the version from a year ago, I couldn't enter characters like "ä" into the shell, so it was not possible to write

echo "ä"

To quickly test if UTF-8 is supported and at which level. A workaround is to write the byte-sequences octal:

$ echo -e "\0303\0244"
ä

Still issues I do have when I execute my windows php.exe binary to output text:

$ php -r 'echo "\xC3\xA4";'
ä

This does not give the the "ä" in the terminal, but it outputs "├ñ" instead. The workaround I have for that is, that I wrap the php command in a bash-script that processes the output through cat:

#!/bin/bash

{ php.exe "$@" 2>&1 1>&3 | cat 1>&2; } 3>&1 | cat

ref. reg. stdout + stderr cat

This magically then makes php working again:

$ php -r 'echo "\xC3\xA4";'
ä

Applies to

$ git --version
git version 1.9.4.msysgit.1

I must admit I miss deeper understanding why this is all the way it is. But I'm finally happy that I found a workaround to use php in git bash with UTF-8 support.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Nice addition to other answers. I had to use this workaround for PHP scripts even now - `git version 2.8.2.windows.1`. – Furgas May 12 '16 at 12:45
4

For me the solution was just to enable unicode support.
Docs: https://github.com/msysgit/msysgit/wiki/Git-for-Windows-Unicode-Support

git config --global core.quotepath off

zezba9000
  • 3,247
  • 1
  • 29
  • 51
2

I found the following steps helpful:

  1. Run Git Bash
  2. Right-click and select Options...
  3. Select Text group at the left
  4. Change Font to Consolas
  5. Select C as Locale and UTF-8 as Character set
  6. Apply and Save.

Git Bash Options

  1. In the terminal execute:
git config --global core.quotepath false
  1. In rare cases, execute in the terminal as well:
export LANG='C.UTF-8'
Maxim Suslov
  • 4,335
  • 1
  • 35
  • 29
0

I use Git Bash Here for spliting files. But after I split files, I use notepad++ (Import Python Scripts) and this very good code to change from ANSII/UTF-8 to UTF-8-BOM

# -*- coding: utf-8 -*-
from __future__ import print_function

from Npp import notepad
import os

uft8_bom = bytearray(b'\xEF\xBB\xBF')
top_level_dir = notepad.prompt('Paste path to top-level folder to process:', '', '')
if top_level_dir != None and len(top_level_dir) > 0:
    if not os.path.isdir(top_level_dir):
        print('bad input for top-level folder')
    else:
        for (root, dirs, files) in os.walk(top_level_dir):
            for file in files:
                full_path = os.path.join(root, file)
                print(full_path)
                with open(full_path, 'rb') as f: data = f.read()
                if len(data) > 0:
                    if ord(data[0]) != uft8_bom[0]:
                        try:
                            with open(full_path, 'wb') as f: f.write(uft8_bom + data)
                            print('added BOM:', full_path)
                        except IOError:
                            print("can't change - probably read-only?:", full_path)
                    else:
                        print('already has BOM:', full_path)

Print Screen

SOURCE:

Just Me
  • 864
  • 2
  • 18
  • 28
-1

The problem with chcp 65001 is that there are bugs in the C runtime (MSVCRT) that make stdio calls return inconsistent results when run under code page 65001.

That should be better with Git 2.23 (Q3 2019)

See commit 090d1e8 (03 Jul 2019) by Karsten Blees (kblees).
(Merged by Junio C Hamano -- gitster -- in commit 0328db0, 11 Jul 2019)

gettext: always use UTF-8 on native Windows

On native Windows, Git exclusively uses UTF-8 for console output (both with MinTTY and native Win32 Console).

Gettext uses setlocale() to determine the output encoding for translated text, however, MSVCRT's setlocale() does not support UTF-8.
As a result, translated text is encoded in system encoding (as per GetAPC()), and non-ASCII chars are mangled in console output
.

Side note: There is actually a code page for UTF-8: 65001.

In practice, it does not work as expected at least on Windows 7, though, so we cannot use it in Git. Besides, if we overrode the code page, any process spawned from Git would inherit that code page (as opposed to the code page configured for the current user), which would quite possibly break e.g. diff or merge helpers.
So we really cannot override the code page.

In init_gettext_charset(), Git calls gettext's bind_textdomain_codeset() with the character set obtained via locale_charset(); Let's override that latter function to force the encoding to UTF-8 on native Windows.

In Git for Windows' SDK, there is a libcharset.h and therefore we define HAVE_LIBCHARSET_H in the MINGW-specific section in config.mak.uname, therefore we need to add the override before that conditionally-compiled code block.

Rather than simply defining locale_charset() to return the string "UTF-8", though, we are careful not to break LC_ALL=C: the ab/no-kwset patch series, for example, needs to have a way to prevent Git from expecting UTF-8-encoded input.

And:

See commit 697bdd2 (04 Jul 2019), and commit 9423885, commit 39a98e9 (27 Jun 2019) by Johannes Schindelin (dscho).
(Merged by Junio C Hamano -- gitster -- in commit 0a2ff7c, 11 Jul 2019)

mingw: use Unicode functions explicitly

Many Win32 API functions actually exist in two variants: one with the A suffix that takes ANSI parameters (char * or const char *) and one with the W suffix that takes Unicode parameters (wchar_t * or const wchar_t *).

The ANSI variant assumes that the strings are encoded according to whatever is the current locale.
This is not what Git wants to use on Windows: we assume that char * variables point to strings encoded in UTF-8.

There is a pseudo UTF-8 locale on Windows, but it does not work as one might expect. In addition, if we overrode the user's locale, that would modify the behavior of programs spawned by Git (such as editors, difftools, etc), therefore we cannot use that pseudo locale.

Further, it is actually highly encouraged to use the Unicode versions instead of the ANSI versions, so let's do precisely that.

Note: when calling the Win32 API functions without any suffix, it depends whether the UNICODE constant is defined before the relevant headers are #include'd.
Without that constant, the ANSI variants are used.
Let's be explicit and avoid that ambiguity.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250