0

I have several files encoded in UTF-16LE and I want to convert them to ANSI. I found some suggestions on stack overflow (Convert from ANSI to UTF-8) but this doesn't work. That is that I can convert files but there are spaces between the words and the numbers and this character from the conversion: ÿ þ

import glob
import codecs

for each in glob.glob('path/**/*.txt', recursive=True):

#read input file
with codecs.open(each, 'r', encoding = 'mbcs') as file:
     lines = file.read()
     
#write output file
with codecs.open(each, 'w', encoding = 'UTF-16LE') as file:
     file.write(lines)

What I am missing? Thanks

Alex
  • 11
  • 3
  • If your code is reading a file encoded as UTF-16, you should put `"UTF-16"`, and not `"MBCS"`, in the `open()` call. When you write the file, open the file for output with `encoding="cp1252"`. ANSI is a misnomer and is essentially meaningless because no "ANSI codepage" was ever standardized. Python treats it as a synonym for MBCS, which is also pretty useless because it stands for multibyte character set, which is a class of encodings. And don't call `codecs.open()`. Just call the builtin function `open()`. That is the recommended approach for textfiles. – BoarGules Nov 12 '21 at 15:39
  • Thank you for your advice and the explanation. I tried substituting and inverting. I got this error: UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to – Alex Nov 12 '21 at 15:58
  • FEFF is a byte order mark. Maybe put `UTF-16LE` instead of `UTF-16`. – BoarGules Nov 12 '21 at 16:05
  • OK, I added: `errors='ignore'`, after the encoding and it works. But sometimes it saves some files as UTF-8 and not as ANSI. I don't know why – Alex Nov 12 '21 at 16:38
  • You need to be clear about what you mean by "ANSI". What character do you see in the output that you reckon is UTF-8 rather than "ANSI", which means whatever you want it to mean. – BoarGules Nov 12 '21 at 16:50
  • Your suggestion about `encoding="cp1252"` was very useful and it works. Simply, after I run the program, I open certain *.txt files and bottom right I read UTF-8 instead of ANSI. Sometimes it deletes the text in some *.txt – Alex Nov 12 '21 at 16:56
  • Your title says UTF16-LE to ANSI but your code is opening as ANSI and saving as UTF16. Please clarify. – Mark Tolonen Nov 12 '21 at 17:09
  • At the beginning the program could save to ANSI file format in this way, I don't know why. Now following @BoarGules instruction I have done the substitutions. Now I think I have a problem related with the BOM marks [Python - Decode UTF-16 file with BOM](https://stackoverflow.com/questions/22459020/python-decode-utf-16-file-with-bom) – Alex Nov 12 '21 at 17:29
  • Using `'mcbs'` as the encoding means "encode in current default ANSI code page for Windows ((CP_ACP)". For US/Western European Windows, that will be Windows-1252 a.k.a `cp1252`, but for other versions of Windows it will vary. If you *want* the default code page, it is valid to use, but if you want a *specific* code page, use that instead. – Mark Tolonen Nov 12 '21 at 17:31
  • Use `utf-16` if you have a BOM when reading or want one on writing, use `UTF-16LE` if you don't. – Mark Tolonen Nov 12 '21 at 17:32
  • Also note that converting from UTF-16 to MCBS, not all code points are supported by ANSI code pages so you could get errors. You can use `errors='replace'` or `errors='ignore'` to handle those conditions. – Mark Tolonen Nov 12 '21 at 17:35
  • If you have text that disappears when re-encoded as Windows-1252 and you have `errors='ignore'`, that is because there is a character in the original that *cannot be encoded* in Windows-1252 because the encoding does not contain that character. Example: `ǫ` `ǣ`. – BoarGules Nov 12 '21 at 17:36
  • @MarkTolonen thanks. Following your consideration, I verified it wasn't related to BOM indeed. I had problems using `mcbs` but now with `cp1252` it seems solved using also `errors='ignore'`. @BoarGules yes, I understand. Due to this fact, my issue is that it leaves (saves) certain files in UTF-8 format... – Alex Nov 12 '21 at 17:50
  • If you write to a file opened with `encoding="cp1252"` it is *not* going to change to UTF-8 because of what is in the data. What makes you think you are getting UTF-8? If you have, say, `ë` in your original file (UTF-16: `00EB`) and you write it to a file with `encoding="cp1252"` you will see `ë` (Windows-1252: `EB`) in your output. But if instead you write to the file with `encoding="UTF-8"`, you will get `C3AB` in the output file, and if you then display the file in an editor like Notepad that assumes a Windows encoding, it will show `ë`. – BoarGules Nov 12 '21 at 22:22
  • I have files in UTF-16LE file format. When I run the program some UTF-16LE files change in ANSI and the others change in UTF-8 format. What makes me think that I'm getting UTF-8? It's that I open the files with Notepad and at the right bottom corner of the window I read UTF-8 instead of ANSI. – Alex Nov 13 '21 at 00:27
  • You are telling us about what Notepad (which is not trustworthy on this subject, use Notepad++ instead) says about the file, but you give us no actual indication of what you think is wrong. Please give a few examples along the lines of *my file has `XXXX` in UTF-16 and when I write it to a file with encoding cp1252 I expect to get `YY` but instead I get `ZZZZ` or `ZZZZZZ` which is the UTF-8 representation of UTF-16 `XXXX`*. – BoarGules Nov 13 '21 at 23:48

0 Answers0