23

I'm interested in trying out distributed version control systems. git sounds promising, but I saw a note somewhere for the Windows port of git that says "don't use non-ASCII filenames". I can't find that now, but there is this link. It's put me off git for now, but I don't know if the other options are any better.

Support for non-ASCII filenames is essential for my Japanese company. I'm looking for one that internally stores filenames as Unicode, not a platform-dependent encoding which would cause endless grief. So:

  1. What DVCS support Unicode filenames?
  2. In both Windows and Linux?
  3. Ideally, with the possibility to transfer repositories between Windows and Linux machines with minimal issues?
Craig McQueen
  • 41,871
  • 30
  • 130
  • 181
  • UTF-8 support for msysgit is coming. See http://stackoverflow.com/a/5855213/6309 and the updated answer below: http://stackoverflow.com/a/1274142/6309 – VonC Feb 07 '12 at 09:31

7 Answers7

9

See issue 80 in the same repository. In 2009, there was a discussion on the Git Mailing list (e.g. 1, 2) where the Git maintainer Junio Hamano asked some questions regarding this. I don't have it right here. By joining the thread in a constructive manner you might help in resolving the issue.

In the Java implementation JGit, we always use UTF-8 when we create textual metadata and filenames. That is the only way, but there are some things to consider.

sleske
  • 81,358
  • 34
  • 189
  • 227
robinr
  • 4,376
  • 2
  • 20
  • 18
8

git

August 2009:

The msysgit project is busy fixing UTF-8 support for Git on Windows. It might be fixed in the next release.


Update February 2012

UTF-8 is coming for msysgit, with commits like this one "Update less settings for UTF-8 "

From the Git for Windows Google+ page:

Karsten Blees' UTF-8 patches for Git for Windows has now been merged to 'devel'.
This means the upcoming release will support Unicode filenames!


Update April 2012

It's now released in mSysGit 1.7.10.

See the page Git for Windows Unicode Support.

Community
  • 1
  • 1
8

Bazaar VCS works with unicode filenames internally. And it has very good support for unicode both on Linux and Windows.

bialix
  • 20,053
  • 8
  • 46
  • 63
  • There is a page on their site about Bazaar's Unicode support: http://bazaar-vcs.org/UnicodeSupport – Austin May 07 '09 at 01:40
  • That page is more developers spec, than user doc, and it's a bit out of date. – bialix May 07 '09 at 09:33
  • 2
    I did some basic tests of Bazaar on Windows, and confirmed it could add and merge files even if they had filename characters outside the current system code page. Good stuff. I'll try the repository on a Linux box later and see if it can branch it correctly. – Craig McQueen May 12 '09 at 00:49
  • I did some further tests of Bazaar on Windows, and discovered that while the command line works fine, the GUI fails at committing changes to a file with filename characters outside the current system code page. – Craig McQueen May 12 '09 at 04:14
  • Craig, thank you for comment. This is actually problem of all Python-based programs. I've filed the bug about unicode characters outside current system code page in the command line: https://bugs.launchpad.net/bzr/+bug/375934. It will be fixed shortly. – bialix May 13 '09 at 10:29
  • So I discovered about Python and the Windows command line. See my question about it: http://stackoverflow.com/questions/846850/how-to-read-unicode-characters-from-command-line-arguments-in-python-on-windows – Craig McQueen May 20 '09 at 02:23
  • Thanks, Craig. I've already implemented similar solution for bzr. It will be released as bzr 1.16 or later, IIUC. – bialix May 20 '09 at 10:58
8

Mercurial

On Linux, I think Mercurial just encodes in whatever is the system's encoding (correct me if I'm wrong). So best to set Linux up for UTF-8 for cross-platform compatibility. This is the default for many modern distributions.

On Windows, Mercurial (due to Python's byte string handling) uses the system code page. This just about guarantees bad cross-platform interoperation for non-ASCII characters.

fixutf8 Extension for Windows (prior to Mercurial 2.0)

There is an externally-created Mercurial extension called fixutf8 for Windows which properly handles all Unicode characters (even those outside the current code page) and encodes the filenames as UTF-8 in the Mercurial repository. It thus enables interoperation with Linux as long as Linux is using the UTF-8 encoding. I tried enabling it on my Windows set-up last week, and had a couple of problems with installation. Since then, one problem has been fixed. Now the only issue is that the binary Mercurial distributions are built with Python 2.4, while fixutf8 requires Mercurial to be built with Python 2.5 or higher to load fixutf8. I expect this will be resolved in the near future.

Mercurial 2.0 and later for Windows

fixutf8 seems to be incompatible with Mercurial 2.0 and later, according to the fixutf8 web page. See WindowsUTF8Plan for details on future solutions. I'm not sure when this is expected to be implemented.

CharlesB
  • 86,532
  • 28
  • 194
  • 218
Craig McQueen
  • 41,871
  • 30
  • 130
  • 181
  • When I've looked at the Mercurial code I did not find any unicode support for filenames. – bialix May 11 '09 at 19:49
  • 4
    I maintain the fixutf8 extension and use it daily with a binary build of HG. File a bug http://bitbucket.org/stefanrusek/hg-fixutf8/ and I will gladly take a look. – Stefan Rusek May 13 '09 at 05:36
  • Thanks Stefan. I've heavily edited this answer now that I've successfully installed fixutf8 and found that it works well. I was held back by a bug that you've fixed in the last few days. – Craig McQueen May 13 '09 at 10:52
  • I've had a [problem with the `fixutf8` extension](http://bitbucket.org/stefanrusek/hg-fixutf8/issue/24/error-using-rename-command) recently. That problem seems to be fixed by a [fork of `fixutf8`](http://bitbucket.org/tinyfish/hg-fixutf8). – Craig McQueen Aug 11 '10 at 05:31
  • 1
    fixutf8 does not work with the latest versions of mercurial (e.g. 2.5) – Nathan Sep 22 '12 at 02:04
  • 4
    -1 because this no longer works. As of Dec 2012, Mercurial is **NOT** a Unicode-supporting DVCS, and it will likely have bad support for years to come because, for some strange reason, they decided to treat filenames as "binary blob", as opposed to "text" (for the record, it's because Unix also treats filenames as binary blobs rather than text). – Roman Starkov Dec 03 '12 at 00:14
  • Thanks for letting me know. But maybe Mercurial is aiming to support it natively. See [WindowsUTF8Plan](http://mercurial.selenic.com/wiki/WindowsUTF8Plan). That sounds similar to the way git handles it (works on Linux as long as filesystem is set for UTF8; translate on Windows). – Craig McQueen Dec 03 '12 at 03:06
2

Git on Windows 1.7.10 now uses UTF-8 for filenames regardless of the user's locale.

sleske
  • 81,358
  • 34
  • 189
  • 227
robinr
  • 4,376
  • 2
  • 20
  • 18
0

This is a really tricky problem. The problems come because either tools try to interpret filenames when they don't know the encoding for sure, or because they translate, but translate to a form which cannot handle all cases (e.g. ASCII or UTF-16). None of the main 3 OS's agree on how a filename is encoded either, making things even harder.

For a good understanding of the issues I suggest reading Mercurial's encoding strategy page. It describes how the various platforms vary, and why Mercurial has chosen the strategy it has.

If you really need to do this, then the most basic thing is that ALL systems need to be set-up to use UTF-8 filenames, and not one of the many Japanese code pages. This is easier said than done though, but once it is done, no system should need to translate the filenames into anything else.

No translation, no issues.


*: Yes, I know you can have a default system encoding, but this is not the same as a filesystem encoding. What happens when a filesystem is accessed by multiple systems or it is physically moved between systems?

Paul S
  • 7,645
  • 2
  • 24
  • 36
0

According to this page: Bazaar, Codendi, CVSNT, Monotone, Perforce, Rational Team Concert, Subversion, Surround SCM, Synergy. But there are lots of 'Unknowns' on that page.

Benjol
  • 63,995
  • 54
  • 186
  • 268