1

Ok, this is what my company is currently using for source code control:

TortoiseHg version 2.10 with Mercurial-2.8, Python-2.7.3, PyQt-4.10.2, Qt-4.8.4

Our parent company is a Japanese company and they share source code with us. They get lazy and sloppy and put all kinds of files in subdirectories containing Japanese characters. It's been a royal pain in the butt to have to go through all the subdirectories every time we get source code for new projects and manually deal with them. I've been reading a lot of info on how this can be automated, but I still don't see a relatively easy way. We don't know which Japanese characters we'll encounter so it somehow has to be dynamic. We can't hard-code every possible Japanese character.

Mercurial doesn't handle funky characters above ASCII 255 right? We're trying to create new Mercurial repositories and have something convert these Japanese characters either with a batch or script file before adding the files to the repository or invoking some Mercurial option that will handle them.

Anyone have a solution for this?

  • http://stackoverflow.com/questions/12540247/unicode-filenames-on-windows-mercurial-2-5-or-future has some info and links. – Mike C Mar 07 '14 at 23:04

1 Answers1

0

Mercurial says filenames are "explicitly treated as binary data in an unknown encoding" and recommends against converting to/from Unicode for file names or contents. It doesn't transcode non-ASCII filenames, but I don't know if that means it doesn't accept them.

You don't need to hard-code any characters. All Japanese, Chinese, and Korean Unicode characters fall within a particular range. http://docs.oracle.com/cd/B19306_01/server.102/b14225/appunicode.htm

Hiragana: Range: 3040–309F http://www.unicode.org/charts/PDF/U3040.pdf

Katakana: Range: 30A0–30FF http://www.unicode.org/charts/PDF/U30A0.pdf

CJK Kanji: Range: 4E00–9FCC http://www.unicode.org/charts/PDF/U4E00.pdf

So for example in Python 2.x the range u'[\u3040-\u309F]' should encompass all Hiragana characters.

You could grep the directories & file names for any unicode characters in those ranges. ...Or rather, anything that's not within the ASCII/European range. u'[\u0020-\uD7FF]' and deal w/the Japanese characters you come across.

The question then is what do you want to convert the characters TO?

Romaji transliteration of the Japanese? Or just some placeholder? If you care what it is, you could get the Unicode value & then use something like this to convert it to romaji: http://unidecode.codeplex.com/ then put that transliteration back in the subdirectory name.

Or just replace it w/some kind of placeholder if you don't care what the character is.

According to this link: How do I check if a string is unicode or ascii? with Python 2.x a string could be either str or unicode, so you could check that as well - could be only the offending directory names show up.

Community
  • 1
  • 1
mc01
  • 3,750
  • 19
  • 24