1

I used a php-script to create directories, using words (categories names) that I've fetched from a site (utf-8), but when I created those directories, I see that there are unreadable characters instead of real words.

AFAIK PHP under Windows is working within cp1251 locale and can't work with utf-8 filenames/dirnames.

So the question is, is it possible to use Python to walk over all of the directories and rename them to utf-8 charset?

Looks like that piece of code works, now i only need to make recursive walk through the dirs and rename all of them.

basedir = "C:\\Users\\alex\\Desktop\\1\\save"
dirs = os.listdir(basedir)
for fn in dirs:
    print fn
    nn = fn.decode('utf-8')
    os.rename(os.path.join(basedir,fn), os.path.join(basedir,nn))
Ivan Petrov
  • 155
  • 2
  • 18
  • use Unicode literal: `os.path.expanduser(ur"~\Desktop\1\save")`, to get Unicode filenames. – jfs Oct 09 '15 at 07:35

2 Answers2

2

A few things to clarify:

  • UTF-8 is an encoding, not a character set. The character set is called Unicode. is character 128169 in that character set.

  • The string ".txt" contains 5 characters. You can encode these characters to bytes using an encoding like UTF-8 or UTF-16. Computers store bytes, so a program has to use one of these encodings to internally represent that string.

  • As a consequence there is no such thing as “renaming directories to the Unicode character set”. The file name .txt is these 5 characters, regardless of how the operating system happens to store those characters on disk.

The problem is PHP itself. On Windows PHP internally encodes strings in the local ANSI code page. That code page probably can't encode the character , so PHP is not able to internally represent this string. As a consequence you can never access the file .txt in PHP. The only workaround is using a special module to access those files. See How to open file in PHP that has unicode characters in its name?.

Community
  • 1
  • 1
roeland
  • 5,349
  • 2
  • 14
  • 28
  • Thanks for clarifying. Now i trying to understand that things wit cp and encodings. Also I see that in order to represent directory names correctly i need to convert them to windows-1251 encoding. I did some search and write some code to rename my directories using python (i'll update my post now). So problem is half-solved. The only thing is i'am curious how .decode('utf-8') method works, because i didn't specify encoding to convert. – Ivan Petrov Oct 09 '15 at 07:28
  • 1
    Yes, in PHP that's probably correct. You can use Python to fix the file names, as per Sebastian's answer, but note that these files will no longer be accessible from PHP. – roeland Oct 11 '15 at 20:58
  • @IvanPetrov: the files will be accessible from PHP (and by any other means). My code fixes mojibake such as `торт`, to get `торт` instead. – jfs Oct 15 '15 at 08:08
  • @J.F.Sebastian It is a known limitation of the Windows version of PHP. If your code page does not contain Cyrillic then there is no way in PHP to open a file called `торт.txt`. – roeland Oct 15 '15 at 22:02
  • @J.F.Sebastian Good point. And also, thanks for the edit—I'll keep that in mind for the next time. – roeland Oct 15 '15 at 22:12
2

If php saved your utf-8 filenames as cp1251 then you can recode them back:

>>> correct_filename = u"торт.txt"
>>> mojibake = correct_filename.encode('utf-8').decode('cp1251') # WRONG
>>> print(mojibake) # if you see this;
торт.txt
>>> print(mojibake.encode('cp1251').decode('utf-8')) # recode
торт.txt

Always use Unicode type for filenames on Windows.

To rename all .txt files in a given directory:

#!/usr/bin/env python2
import os
from glob import glob

dirpath = os.path.expanduser(ur"~\Desktop\1\save")
for mojibake_path in glob(os.path.join(dirpath, '*.txt')):
    path = mojibake_path.encode('cp1251').decode('utf-8')
    os.rename(mojibake_path, path)

Note: dirpath is a Unicode string.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Why do i have to use unicode type for filenames when windows uses windows-1251? – Ivan Petrov Oct 09 '15 at 07:48
  • @IvanPetrov: you should use Unicode type if you work with text in Python. Filenames are text on Windows (it provides Unicode API for them). It doesn't matter in this case (all filenames by (invalid) construction have characters in cp1251 range) but in general Unicode supports around a million characters while cp1251 supports only 256. – jfs Oct 09 '15 at 07:53
  • So, finally what encoding do windows use for filenames? Is it unicode (utf-16) or windows-1251? – Ivan Petrov Oct 09 '15 at 07:59
  • @IvanPetrov: Unicode means `type(text) == unicode` here. It has nothing to do with utf-16. All you need to know is that if you pass Unicode filename; you get back a Unicode filename on Windows. **`unicode` is an abstraction** -- Python may use different internal representation on different python versions/builds -- you shouldn't care how it is represented internally in python or Windows (except for some edge cases on narrow python builds in Python 2). – jfs Oct 09 '15 at 08:04