-1

Terminal: screen in xterm on the latest Ubuntu LiveCD.

��� �������.avi

While I'm trying to ls directory, I see this: a

ls -la give me this: b

MidNight Commander show me this: c

$ ls
??? ???????.avi

$ env | grep -i LANG
LANG=en_US.UTF-8

$ export | grep -i LANG
declare -x LANG="en_US.UTF-8"

Looks like this is UTF-16 surrogate, am I right? [

en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Surrogates

I'm trying to trick it through python3, I'm caught such exception:

for i in os.listdir('.'):
    print (i)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc4' in
position 0: surrogates not allowed

I've uploaded file with empty body, just title - 4.0K: https://mega.co.nz/#!roYUyQaB!AwOMDznj9DC_wSpAeWqjVj_Oqu2z8Kfk5VsSmFs0ybA

Watch0nMe
  • 167
  • 1
  • 13
  • It is possible for the file system to contain byte sequences which are not valid UTF-8, that's right. It's an error to use that in a file name on a file system set up for UTF-8 file names. What is unclear? If you need a workaround, remount the file system with a suitable encoding for such file names (the sequence would be valid, albeit gibberish, in many legacy 8-bit encodings, for example). – tripleee Aug 07 '14 at 09:21
  • @tripleee, see my edit. Could you download this file and recognize it for me? What is it? Is it surrogate or what? – Watch0nMe Aug 07 '14 at 09:45
  • Why does it matter? Which particular problem are you trying to solve? – tripleee Aug 07 '14 at 09:47
  • I'm trying to configure my terminal for the future, to prevent any unrecognizable filenames – Watch0nMe Aug 07 '14 at 09:47
  • Most real-life Unicode encodings will have the possibility of invalid code sequences in file names. Just rename the problematic files. – tripleee Aug 07 '14 at 09:49
  • I'm doing this for a long time. When you have great p2p exchange network, such files appears each week/days. I don't feel this is True-Way(Unix-Way), to rename such files each times. – Watch0nMe Aug 07 '14 at 09:58
  • @tripleee, your reputation 28,942. Do you know any universal way to decode anything by default in terminal? I'm just tired to meet such filenames. – Watch0nMe Aug 07 '14 at 10:36
  • Does this answer your question? [What is character encoding and why should I bother with it](https://stackoverflow.com/questions/10611455/what-is-character-encoding-and-why-should-i-bother-with-it) – tripleee Jan 17 '22 at 12:09

1 Answers1

2
$ echo $'\xc4\xf3\xf5 \xe2\xf0\xe5\xec\xe5\xed\xed' | chardet
<stdin>: MacCyrillic (confidence: 0.92)
$ echo $'\xc4\xf3\xf5 \xe2\xf0\xe5\xec\xe5\xed\xed' | enca -L ru
MS-Windows code page 1251
  LF line terminators
$ echo $'\xc4\xf3\xf5 \xe2\xf0\xe5\xec\xe5\xed\xe8' | iconv -f 'Windows-1251'
Дух времени

So you need to set your terminal to Windows-1251.

Karol S
  • 9,028
  • 2
  • 32
  • 45