2

I need to replace non ASCII char like ¾ in Python but I get

SyntaxError: Non-ASCII character '\xc2' in file test.py but no encoding declared; see http://www.python.org/peps/pep-0263.html for details`

After following the directions on the webpage, I am getting

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 449: ordinal not in range(128)

Here's my code:

data = data.replace(u"½", u"1/2")
data = re.sub(u"¾", u"3/4", data, flags=re.DOTALL)

What do I need to change in my code?


my file is:

#!/usr/bin/python

with codecs.open("file.txt", "r", "utf8") as myfile:
    data = myfile.read()

data = data.replace(u"½", u"1/2")

file.txt is:

hello world ½
wim
  • 338,267
  • 99
  • 616
  • 750
Erik
  • 61
  • 7

3 Answers3

0

You're reading into the local variable data as bytes but then treating data it like it's already a unicode object.

Change this:

with open(file_name, "r") as myfile:
    data = myfile.read()

To this:

import io

with io.open(file_name, encoding="utf8") as myfile:
    data = myfile.read()
wim
  • 338,267
  • 99
  • 616
  • 750
-1

It looks like you want to read it as unicode but pyhton reads it as a string. Try this, the question looks similar to your UnicodeDecodeError

https://stackoverflow.com/a/18649608/5504999

Try adding #coding: utf-8 on top of your file. This will allow the usage of Non-ASCII characters.

Community
  • 1
  • 1
Imtiaz Raqib
  • 315
  • 2
  • 20
  • I get: UnicodeEncodeError: 'ascii' codec can't encode character u'\uf057' in position 383: ordinal not in range(128) – Erik Mar 22 '16 at 16:36
  • Did you try reading your first parameter in replace() with **u.decode('utf-8')**? – Imtiaz Raqib Mar 22 '16 at 16:39
  • Look at @wim's answer. – Imtiaz Raqib Mar 22 '16 at 16:50
  • I tried: with codecs.open(HTML_PATH + file_name, "r", "utf8") as myfile: data = myfile.read() data = data.replace(u"½", u"1/2") and i get: SyntaxError: Non-ASCII character '\xc2' in file – Erik Mar 22 '16 at 16:52
  • Try my answer, i.e., add `#coding: utf-8` on top of your file. It allows program to read non-ascii characters. – Imtiaz Raqib Mar 22 '16 at 16:58
-1

I think your initial string is not properly encoded as unicode.

What you are attempting works just fine:

>>> st=u"¼½¾"
>>> print st.replace(u"½", u"1/2")
¼1/2¾

But the target needs to be unicode to start with.

dawg
  • 98,345
  • 23
  • 131
  • 206
  • that's exactly what my code do: `data.replace(u"½", u"1/2")` but does not work – Erik Mar 22 '16 at 16:47
  • 1
    `data` is not a unicode string. That is why it is not working for you. Look at wim's answer. – dawg Mar 22 '16 at 16:47
  • I tried: with codecs.open(HTML_PATH + file_name, "r", "utf8") as myfile: data = myfile.read() data = data.replace(u"½", u"1/2") and i get: SyntaxError: Non-ASCII character '\xc2' in file – Erik Mar 22 '16 at 16:50