44

I'm trying to write a script in python to convert utf-8 files into ASCII files:

#!/usr/bin/env python
# *-* coding: iso-8859-1 *-*

import sys
import os

filePath = "test.lrc"
fichier = open(filePath, "rb")
contentOfFile = fichier.read()
fichier.close()

fichierTemp = open("tempASCII", "w")
fichierTemp.write(contentOfFile.encode("ASCII", 'ignore'))
fichierTemp.close()

When I run this script I have the following error :

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 13: ordinal not in range(128)

I thought that can ignore error with the ignore parameter in the encode method. But it seems not.

I'm open to other ways to convert.

Nicolas
  • 6,289
  • 4
  • 36
  • 51
  • 2
    The problem is that you never decode in the first place. – Ignacio Vazquez-Abrams Nov 28 '10 at 23:23
  • You got the error because the character doesn't exist in the ASCII character set, so it can't be converted. Sometimes you can map the UTF8 character to a closest visual-fit character in ASCII, such as `é` to `e`, but that can change the meaning of words. You have to decide if that path will work for your application. – the Tin Man Nov 28 '10 at 23:24
  • This seems like a really bad idea!! – tchrist Nov 28 '10 at 23:55

3 Answers3

69
data="UTF-8 DATA"
udata=data.decode("utf-8")
asciidata=udata.encode("ascii","ignore")
Utku Zihnioglu
  • 4,714
  • 3
  • 38
  • 50
  • 18
    Sounds like a bad recipe for data loss. – tchrist Nov 28 '10 at 23:55
  • 53
    You should expect data loss if you wish to convert from a 8bit encoding to 7bit. – Utku Zihnioglu Nov 29 '10 at 00:01
  • 3
    I ignored that I have to decode first. It works now thanks. To answer to the questions, I want to do this because my MP3 player can only display lyrics files encoded in ASCII. – Nicolas Nov 29 '10 at 21:33
  • You can have a look at this solution: http://stackoverflow.com/a/517974/1463812 – JSBach Mar 27 '17 at 05:08
  • I get `AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?` for the second line with python 3.10.4 – peer Jun 08 '23 at 22:34
9
import codecs

 ...

fichier = codecs.open(filePath, "r", encoding="utf-8")

 ...

fichierTemp = codecs.open("tempASCII", "w", encoding="ascii", errors="ignore")
fichierTemp.write(contentOfFile)

 ...
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
6

UTF-8 is a superset of ASCII. Either your UTF-8 file is ASCII, or it can't be converted without loss.

Tobu
  • 24,771
  • 4
  • 91
  • 98
  • 15
    I think he's aware of that, otherwise he wouldn't be trying to use `'ignore'`. – Ignacio Vazquez-Abrams Nov 28 '10 at 23:28
  • 1
    @Ignacio True. But this one left me wondering what the asker is trying to achieve. They could be cargo-culting, or maybe their need is best met by something like urlencode, or being lossy is just acceptable. – Tobu Nov 28 '10 at 23:43
  • I am afraid of the cargo-culting. Culling all characters that you don’t have an appreciation for is really insensitive. – tchrist Nov 28 '10 at 23:58
  • @Ignacio: Imagine being addressed as *Vzquez-Abrams*. :( – tchrist Nov 29 '10 at 00:11
  • @tchrist: That's why I never use it. – Ignacio Vazquez-Abrams Nov 29 '10 at 00:13
  • Sometimes you can convert UTF8 to ASCII without losses, for instance, single quotes or apostrophes, in few other cases - arithmetic operations - both available as UTF8 long encoding and ASCII single symbol. – Kovalex Nov 29 '21 at 06:49