UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3 2: ordinal not in range(128)

Question

I am parsing an XSL file using xlrd. Most of the things are working fine. I have a dictionary where keys are strings and values are lists of strings. All the keys and values are Unicode. I can print most of the keys and values using str() method. But some values have the Unicode character \u2013 for which I get the above error.

I suspect that this is happening because this is Unicode embedded in Unicode and the Python interpreter cannot decode it. So how can I get rid of this error?

Lennart Regebro · Answer 1 · 2011-03-22T07:30:49.290

You can print Unicode objects as well, you don't need to do str() around it.

Assuming you really want a str:

When you do str(u'\u2013') you are trying to convert the Unicode string to a 8-bit string. To do this you need to use an encoding, a mapping between Unicode data to 8-bit data. What str() does is that is uses the system default encoding, which under Python 2 is ASCII. ASCII contains only the 127 first code points of Unicode, that is \u0000 to \u007F1. The result is that you get the above error, the ASCII codec just doesn't know what \u2013 is (it's a long dash, btw).

You therefore need to specify which encoding you want to use. Common ones are ISO-8859-1, most commonly known as Latin-1, which contains the 256 first code points; UTF-8, which can encode all code-points by using variable length encoding, CP1252 that is common on Windows, and various Chinese and Japanese encodings.

You use them like this:

u'\u2013'.encode('utf8')

The result is a str containing a sequence of bytes that is the uTF8 representation of the character in question:

'\xe2\x80\x93'

And you can print it:

>>> print '\xe2\x80\x93'
–

This was very comprehensive. Thanks. I had a question - Lets say twitter stream, you would not know the encoding upfront. How would you handle that ? — karthikr, Jun 28 '13 at 16:34
@karthikr: I find it hard to believe that Twitter doesn't provide the encoding. — Lennart Regebro, Jun 28 '13 at 20:59

score 29 · Answer 2 · answered Jan 14 '15 at 17:18

29

You can also try this to get the text.

foo.encode('ascii', 'ignore')

answered Jan 14 '15 at 17:18

Bilbo Baggins

3,644
8
40
64

1

After many SO searches, this fixed it for me. My particular usage was in a print due to both Windows and Linux throwing this encoding error. – ddisqq Jun 08 '16 at 17:04
This will lose data for any non-ascii, the correct method is to encode using the correct encoding. – Padraic Cunningham Sep 29 '16 at 13:12
This will ignore the non-ASCII character. Your answer is just to ignore the problem? – John Strood Aug 14 '18 at 11:30
If you aren't using non-ASCII characters, then yes. – Bilbo Baggins Aug 14 '18 at 23:02

score 7 · Answer 3 · edited May 23 '17 at 12:10

7

As here str(u'\u2013') is causing error so use isinstance(foo,basestring) to check for unicode/string, if not of type base string convert it into Unicode and then apply encode

if isinstance(foo,basestring):
    foo.encode('utf8')
else:
    unicode(foo).encode('utf8')

further read

edited May 23 '17 at 12:10

Community

1
1

answered Jan 05 '15 at 08:40

Vaseem Ahmed Khan

795
7
13

score 5 · Answer 4 · edited Apr 28 '17 at 18:09

5

I had the same problem. This work fine for me:

str(objdata).encode('utf-8')

edited Apr 28 '17 at 18:09

FelixSFD

6,052
10
43
117

answered Apr 28 '17 at 18:08

Mohsen

4,049
1
31
31

Chris · Answer 5 · 2019-11-09T15:42:48.430

I had exactly this issue in a recent project which really is a pain in the rear. I finally found it's because the Python we used in Docker has encoding "ansi_x3.4-1968" instead of "utf-8". So if anyone out there using Docker and got this error, following these steps may thoroughly solve your problem.

create a file and name it default_locale in the same directory of your Dockerfile, put this line in it,

environment=LANG="es_ES.utf8", LC_ALL="es_ES.UTF-8", LC_LANG="es_ES.UTF-8"
add these to your Dockerfile,

RUN apt-get clean && apt-get update && apt-get install -y locales

RUN locale-gen en_CA.UTF-8

COPY ./default_locale /etc/default/locale

RUN chmod 0755 /etc/default/locale

ENV LC_ALL=en_CA.UTF-8

ENV LANG=en_CA.UTF-8

ENV LANGUAGE=en_CA.UTF-8

This thoroughly solved my issue when I built and run my Docker again, hopefully this solve your issue also.

score 0 · Answer 6 · answered Nov 21 '17 at 10:13

0

for me this works

unicode(data).encode('utf-8')

answered Nov 21 '17 at 10:13

Ulv3r

1

score 0 · Answer 7 · edited Aug 04 '21 at 09:25

0

First find out what character is a unicode in this link https://unicode-table.com/en/2013/

Then in the code use this:

{your-string-variable}.replace(u"\u2013", "-")

likewise for all the unicodes having error.

edited Aug 04 '21 at 09:25

MD Mushfirat Mohaimin

1,966
3
10
22

answered Aug 04 '21 at 07:43

Nitin Rane

31
3

score 0 · Answer 8 · answered Mar 16 '23 at 17:34

While reading an excel file using openpyxl, I encountered the same error and I opted to write a function to help me remove any non-ascii characters while keeping the new line character.

def clean_string(b_string):
    # Decode bytes object to string and remove non-ASCII characters except newlines
    cleaned_string = ''
    for byte in b_string:
        if byte == ord('\n') or byte < 128:
            cleaned_string += chr(byte)
    return cleaned_string

And its usage like:

def upload_conditions(request):
    # Condition.objects.all().delete()
    if request.method == 'POST':
        excel_file = request.FILES.get("nfile")
        wb = openpyxl.load_workbook(excel_file)
        excel_data = list()

        for letter in ["A"]:
            worksheet = wb[letter]
            for row in worksheet.iter_rows():
                row_data = list()
                for cell in row:
                    value = cell.value
                    if value:
                        encoded_string = str(value).encode('utf-8', 'ignore')
                        row_data.append(clean_string(encoded_string))
                        
                    else:
                        row_data.append("")
                    
                excel_data.append(row_data)

Calling the clean_string function on an encoded string, quickly helped clean the b_string or a string with non-ascii characters.

To encode a string I did it like this:

'''
value - a string to encode with ascii like this one - b'Legionnaires\xe2\x80\x99 disease\nwith newline'
'''
encoded_string = str(value).encode('utf-8', 'ignore')

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3 2: ordinal not in range(128)

8 Answers8

Linked