UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

Question

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.

The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.

One of the sections of code that is causing problems is shown below:

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

Here is a stack trace produced on SOME strings when the snippet above is run:

Traceback (most recent call last):
  File "foobar.py", line 792, in <module>
    p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.

Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?

If you're getting these errors as a user rather than as a developer, check https://serverfault.com/questions/54591/how-to-install-change-locale-on-debian and https://askubuntu.com/questions/599808/cannot-set-lc-ctype-to-default-locale-no-such-file-or-directory — That Brazilian Guy, Jun 11 '18 at 00:52
I'll add this point don't use https://www.onlinegdb.com/online_python_interpreter for this stuff. Was using that interpreter to trial stuff out and it's not configured correctly for Unicode! Was always printing in a format 'B'\nnn''... when all I wanted was a guillemet! Tried on a VM and it worked immediately as expected using chr() — JGFMK, Aug 07 '18 at 22:06
Try this `import os; import locale; os.environ["PYTHONIOENCODING"] = "utf-8"; myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8"); ... print(myText.encode('utf-8', errors='ignore'))`. — hhh, Apr 15 '19 at 21:48
@hhh I ran your snippet NameError: name 'myText' is not defined — KHAN irfan, Apr 20 '19 at 10:31
Try to set [PYTHONIOENCODING](https://docs.python.org/3/using/cmdline.html#envvar-PYTHONIOENCODING) in the shell, before executing your script: `$ export PYTHONIOENCODING=utf8` — Noam Manos, Aug 11 '19 at 08:50
it my case the string was - u'1d6f4975842f050bf6503b19250d09f997b34f4a\n' , I just used `.encode('utf-8').strip()` over the same string. What is does is - it remove the last `\n` from the string which was creating the problem before, even after used encode('utf-8') before. — Indrajeet Gour, May 04 '22 at 14:58

score 1515 · Accepted Answer · edited Aug 29 '22 at 00:23

1515

Read the Python Unicode HOWTO. This error is the very first example.

Do not use str() to convert from unicode to encoded text / bytes.

Instead, use .encode() to encode the string:

p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

or work entirely in unicode.

edited Aug 29 '22 at 00:23

Mateen Ulhaq

24,552
19
101
135

answered Mar 30 '12 at 12:21

agf

171,228
44
289
238

32

agreed! a good rule of thumb I was taught is to use the "unicode sandwich" idea. Your script accepts bytes from the outside world, but all processing should be done in unicode. Only when you are ready to output your data should it be mushed back into bytes! – Andbdrew Mar 30 '12 at 12:29
274

In case someone else gets confused by this, I found a strange thing: my terminal uses utf-8, and when I `print` my utf-8 strings it works nicely. However when I pipe my programs output to a file, it throws a `UnicodeEncodeError`. In fact, when output is redirected (to a file or a pipe), I find that `sys.stdout.encoding` is `None`! Tacking on `.encode('utf-8')` solves the problem. – drevicko Dec 18 '12 at 08:15
108

@drevicko: use `PYTHONIOENCODING=utf-8` instead i.e., print Unicode strings and let the environment to set the expected encoding. – jfs Dec 21 '13 at 03:51
@J.F.Sebastian: Do you think that's a valid approach in every case? Let's say you have a tool that's exporting a report which needs to have a particular encoding, it seems less straightforward to me that the user would need to change the environment settings just for that one export. Then I would rather have the program take the encoding as a parameter with a sensible default. – steinar Nov 25 '15 at 09:59
2

@steinar: nothing is valid in every case. In general, a user shouldn't care that you use Python to implement your utility (the interface shouldn't change if you decide to reimplement it in another language for whatever reason) and therefore you should not expect that user even aware about python-specific envvars. It is bad UI to force user to specify character encoding; embed the character encoding in the report format if necessary. Note: no hardcoded encoding can be "sensible default" in the general case. – jfs Nov 25 '15 at 10:24
@J.F.Sebastian I agree that this varies by use case. My concern was that readers might follow the pattern of setting the environment variable rather than having it more clearly configurable, leading to users having to worry about in what language the utility was/will be written. I think having users specify the character encoding can make a lot of sense in some cases and seems then to be a better alternative than setting environment variables. But then I'm thinking of tools for advanced users. – steinar Nov 25 '15 at 10:45
`PYTHONIOENCODING` seems to be required if the locale is incorrectly built or the user is using a non-UTF-8 locale. Check `locale` before setting `PYTHONIOENCODING` – Alastair McCormack Nov 30 '15 at 21:26
24

This is bad and confusing advice. The reason people use str is because the object IS NOT already a string, so there's no `.encode()` method to call. – Cerin Oct 05 '16 at 17:59
1

@Cerin I'm not sure what you mean. If you're getting this error, you definitely have an object of type `str` -- a string. – agf Oct 07 '16 at 02:05
1

@agf, Nope, this error was being thrown by this code, `unicode(list(DjangoQueryObject))` – Cerin Oct 07 '16 at 15:49
What is the reason for the .strip()? Is that recommended for all conversion calls to .encode('utf-8')? – Praxiteles Apr 27 '17 at 07:07
1

@Praxiteles No, it's not. It was just copied from the original code, like the join and the variable names. It's something specific to his use case. – agf Apr 28 '17 at 20:45
@drevicko right on that is totally what is happening – Ben Liyanage Oct 19 '17 at 20:56
I perused the docs you referenced, but still don't understand why both `u''` and `.encode('utf-8')` are necessary. Doesn't the "u" indicate a unicode string? – Matt Mar 15 '18 at 22:10
1

@Matt I'm not sure I understand your question. Yes, "u" indicates a unicode string. The `encode` takes the decoded unicode string and encodes it into a specific utf-8 byte sequence. So after encoding it as utf-8, it's just a sequence of bytes, not unicode any more, as far as Python is concerned. – agf Mar 15 '18 at 22:19
@agf It was that one is a unicode string and the other is a byte sequence that I wasn't appreciating. I still don't understand why piping in the shell breaks without the `encode()` call (per @drevicko's comment.) – Matt Mar 19 '18 at 15:38
I use Mojibake (https://github.com/BeetleChunks/Mojibake) in my projects for managing data encodings. It has some useful functions for encoding/decoding strings and unicode as well as dicts, tuples, and lists. Its a good reference for how to encode/decode various data types too. – Beetle Jun 01 '19 at 05:10
i get the following error: Can't convert 'bytes' object to str implicitly – nomaam Jan 12 '20 at 21:47
2

@nomaam Then either you need to explicitly call `decode` to turn the bytes into a string, or you've got `decode` where you need `encode` or `encode` where you need `decode`. – agf Jan 13 '20 at 22:02

Andbdrew · Answer 2 · 2012-08-02T14:16:05.730

484

This is a classic python unicode pain point! Consider the following:

a = u'bats\u00E0'
print a
 => batsà

All good so far, but if we call str(a), let's see what happens:

str(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:

a.encode('utf-8')
 => 'bats\xc3\xa0'
print a.encode('utf-8')
 => batsà

Voil\u00E0!

The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.

For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html

edited Aug 02 '12 at 14:16

answered Mar 30 '12 at 12:25

Andbdrew

11,788
4
33
37

96

Personal note: When trying to type ".encode" don't accidentally type ".unicode" then wonder why nothing is working. – Skip Huffman Dec 24 '12 at 14:38
13

Good advice. But what do you do instead when you were using str(x) to print objects that may or may not be strings? str(x) works if x is a number, date time, boolean, or normal string. Suddenly if its a unicode it stops working. Is there a way to get the same behaviour or do we now need to add an IF check to test if the object is string to use .encode, and str() otherwise? – Dirk R Jan 25 '18 at 16:50
Same question could be asked with `None` value. – Vadorequest Nov 19 '18 at 13:54

score 249 · Answer 3 · edited May 19 '18 at 14:42

249

I found elegant work around for me to remove symbols and continue to keep string as string in follows:

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

It's important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):

>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'

edited May 19 '18 at 14:42

Edeson Bizerril

1,595
1
11
12

answered Aug 20 '14 at 10:13

Max Korolevsky

2,671
1
11
2

19

You made my day! For utf-8, it's sufficient to do: `yourstring = yourstring.encode('utf-8', 'ignore').decode('utf-8')` – luca76 Feb 14 '17 at 15:38
for me this did work but my case was different, i was saving file names and was having "/" in the name and the path didn't existed so I have to use .replace("/","") and thus saved mine script. while ignoring the ascii also works for 'utf-8' case also. – Akash Kandpal Jul 05 '18 at 08:38
1

@harrypotter0 for concatenating file paths correctly use `os.path.join()`, it's a very good habit when you start doing cross-platform programming. :) – login_not_failed Aug 22 '18 at 07:36

score 190 · Answer 4 · answered Sep 02 '16 at 13:10

190

well i tried everything but it did not help, after googling around i figured the following and it helped. python 2.7 is in use.

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

answered Sep 02 '16 at 13:10

Ashwin

2,875
1
13
21

8

Don't do this. http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script, although when you have answers like this http://stackoverflow.com/a/31137935/2141635 near the top of the results when you search for the error I can see why it may seem like a good idea. – Padraic Cunningham Sep 08 '16 at 11:50
33

I tried almost all of the suggestions in this topic and really none worked for me. Finally I tried this one. And it's really THE ONLY one what worked simple and good. If someone say "Don't do this, then come with a simple Solution. Otherwise use this one. Because it's a good working copy and past solution. – Richard de Ree Aug 24 '17 at 11:48
This solution works with file.write(), as well as print(). – Jacob Quisenberry Dec 01 '17 at 17:47
4

How could this be done in python3 ? Would be happy to know. – Kanerva Peter Mar 07 '18 at 12:24
1

Don't do this! If you do this, you can avoid *heaps* of arcane knowledge of Python2 and unicode! The horror! – Prof. Falken Jul 15 '18 at 13:38
6

I'd just add an `if sys.version_info.major < 3:` – Prof. Falken Jul 15 '18 at 13:52
I agree. I use import sys import io sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = 'utf-8') sys.stderr = io.TextIOWrapper(sys.stderr.detach(), encoding = 'utf-8') – workin 4weekend Jul 28 '19 at 17:21
Confirmed for my Python 2.6, 2.7 versions. Tried other solutions but those won't work. Only this one. – Alper Jan 22 '21 at 11:06
1

The problem with this solution is that it changes the encoding for all of Python, which means that external modules, which could perhaps have been written with the assumption that ASCII is the default encoding, are now operating with a different encoding. I am not brave enough to risk chasing bugs in all the external modules that my particular application uses. So if it's working for you, it may be only because you're not using an external module that is fragile in this way. – Iron Pillow May 11 '21 at 15:52

maxpolk · Answer 5 · 2013-12-02T18:04:24.693

106

A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale

$ echo $LANG
en_US.utf8
$ echo $LC_ALL 
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà

edited Dec 02 '13 at 18:04

answered Dec 02 '13 at 17:58

maxpolk

2,167
1
18
24

Got exactly same issue, so bad I didn't checked it before [reporting](https://github.com/GoogleCloudPlatform/gsutil/issues/244). Thanks a lot. By the way, you can replace first two commands with `env|grep -E '(LC|LANG)'`. – Dmitry Verhoturov Aug 08 '15 at 07:28
Just my two cents on wrong encoding issue. I frequently use `mc` in "subshell mode" (`Ctrl-O`) and I also forgot that I added the following alias to bash: `alias mc="LANG=en_EN.UTF-8 mc"`. So when I tried to run poorly-written scripts which rely on `ru_RU.UTF-8` internally, they just die. Tried lots of stuff from this thread before I discovered the real issue. :) – login_not_failed Aug 22 '18 at 07:52
YOU ARE AWESOME. In GSUTIL, my rsync was failing because of exactly this problem. Fixed the LC_ALL and everything works fine as wine. <3 THANK YOU <3 – dsignr Dec 23 '19 at 05:24

kenorb · Answer 6 · 2019-01-30T20:13:55.513

41

The problem is that you're trying to print a unicode character, but your terminal doesn't support it.

You can try installing language-pack-en package to fix that:

sudo apt-get install language-pack-en

which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).

On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.

Then when writing the code, make sure you use the right encoding in your code.

For example:

open(foo, encoding='utf-8')

If you've still a problem, double check your system configuration, such as:

Your locale file (/etc/default/locale), which should have e.g.

LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

or:

LC_ALL=C.UTF-8
LANG=C.UTF-8

Value of LANG/LC_CTYPE in shell.
Check which locale your shell supports by:
```
locale -a | grep "UTF-8"
```

Demonstrating the problem and solution in fresh VM.

Initialize and provision the VM (e.g. using vagrant):
```
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
```
^{See: available Ubuntu boxes.}.

Printing unicode characters (such as trade mark sign like ™):

$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)

Now installing language-pack-en:

$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
  language-pack-en-base
Generating locales...
  en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.

Now problem should be solved:
```
$ python -c 'print(u"\u2122");'
™
```

Otherwise, try the following command:

$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
™

edited Jan 30 '19 at 20:13

answered Aug 13 '15 at 12:07

kenorb

155,785
88
678
743

1

What has `language-pack-en` got to do with Python or this question? AFAIK, it may provide language translations to messages but has nothing to do with encoding – Alastair McCormack Dec 26 '15 at 10:47
3

On some Linux distributions it's required in order to make sure that the default English locales are set-up properly, especially when running Python script on the Terminal. It worked for me at one point. See: [character encoding](http://unix.stackexchange.com/q/3218/21471) – kenorb Dec 26 '15 at 11:00
Ah, ok. You mean if you want to use a non-English locale? I guess the user will also have to edit `/etc/locale.gen` to ensure their locale is built before using it? – Alastair McCormack Dec 26 '15 at 11:04
@AlastairMcCormack Added reproducible steps of the problem in VM to make it clearer. I've tested and the solution still works. – kenorb Dec 26 '15 at 22:55
Great update @kenorb! Instead of installing `language-pack-en`, what would've happened if you just uncommented `en_GB.UTF-8` in `/etc/locale.gen` and ran `locale-gen`? Also, what was the result of `locale` before making changes. The reason I ask is that I'm intrigued what `language-pack-en` actually does :) – Alastair McCormack Dec 26 '15 at 23:08
1

@AlastairMcCormack Commented out `LANG` from `/etc/default/locale` (as `/etc/locale.gen` does't exist) and ran `locale-gen`, but it didn't help. I'm not sure what `language-pack-en` exactly does, as I didn't find much documentation and listing the content of it doesn't help much. – kenorb Dec 27 '15 at 13:07
1

it is unlikely that there are no utf-8 locales on a desktop system already i.e., it is likely that you don't need to install anything, just configure `LANG`/ `LC_CTYPE`/ `LC_ALL` instead (e.g., `LANG=C.UTF-8`). – jfs Jan 01 '16 at 03:10

score 33 · Answer 7 · answered Jan 30 '19 at 20:21

33

In shell:

Find supported UTF-8 locale by the following command:
```
locale -a | grep "UTF-8"
```

Export it, before running the script, e.g.:

export LC_ALL=$(locale -a | grep UTF-8)

or manually like:

export LC_ALL=C.UTF-8

Test it by printing special character, e.g. ™:
```
python -c 'print(u"\u2122");'
```

Above tested in Ubuntu.

answered Jan 30 '19 at 20:21

kenorb

155,785
88
678
743

Yes this is the best short answer, we cannot modify the source code to use .encode – Neo.Mxn0 Jan 19 '20 at 18:17
I used it in python3 and its working fine now after setting LC_ALL. Thanks – Ajay Jul 18 '20 at 14:03

score 30 · Answer 8 · answered Nov 01 '13 at 13:44

30

I've actually found that in most of my cases, just stripping out those characters is much simpler:

s = mystring.decode('ascii', 'ignore')

answered Nov 01 '13 at 13:44

Phil LaNasa

2,907
1
25
16

31

"Perfectly" is not usually what it performs. It throws away stuff which you should figure out how to deal with properly. – tripleee Dec 13 '14 at 16:53
8

just stripping out "those" (non-english) characters is not the solution since python must support all languages dont you think? – alemol Jan 09 '15 at 19:47
8

Downvoted. This is not the correct solution at all. Learn how to work with Unicode: http://www.joelonsoftware.com/articles/Unicode.html – Andrew Ferrier Jan 13 '15 at 13:04
1

Agreed. Downvoted because if you actually took time to watch Ned Batchelder's unicode pain video you'd know why. – kenny Jan 18 '15 at 14:28
4

Look, the most judicious way to present this particular answer is in this way: recognizing that ascii confers a certain privilege on certain languages and users - this is the _escape hatch_ that may be exploited for those users who may be hacking a cursory, first pass, script together potentially for preliminary work before full unicode support is implemented. – lol Jun 19 '16 at 14:07
7

If I'm writing a script that just needs to print english text to stdout in an internal company application, I just want the problem to go away. Whatever works. – kagronick May 11 '17 at 14:11
This is plain wrong. I want my package/courier to be delivered and just because authorities cant figure out how to put them in shipping/ how to categorize them (say encoding) they must not ignore my package/courier. I want them delivered right – angrysumit Aug 08 '19 at 13:35

score 29 · Answer 9 · answered Jan 26 '15 at 14:53

29

For me, what worked was:

BeautifulSoup(html_text,from_encoding="utf-8")

Hope this helps someone.

answered Jan 26 '15 at 14:53

Animesh

1,765
2
22
36

BuvinJ · Answer 10 · 2020-11-02T14:38:22.560

Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.

def safeStr(obj):
    try: return str(obj)
    except UnicodeEncodeError:
        return obj.encode('ascii', 'ignore').decode('ascii')
    except: return ""

Testing it:

if __name__ == '__main__': 
    print safeStr( 1 ) 
    print safeStr( "test" ) 
    print u'98\xb0'
    print safeStr( u'98\xb0' )

Results:

1
test
98°
98

UPDATE: My original answer was written for Python 2. For Python 3:

def safeStr(obj):
    try: return str(obj).encode('ascii', 'ignore').decode('ascii')
    except: return ""

Note: if you'd prefer to leave a ? indicator where the "unsafe" unicode characters are, specify replace instead of ignore in the call to encode for the error handler.

Suggestion: you might want to name this function toAscii instead? That's a matter of preference...

Finally, here's a more robust PY2/3 version using six, where I opted to use replace, and peppered in some character swaps to replace fancy unicode quotes and apostrophes which curl left or right with the simple vertical ones that are part of the ascii set. You might expand on such swaps yourself:

from six import PY2, iteritems 

CHAR_SWAP = { u'\u201c': u'"'
            , u'\u201D': u'"' 
            , u'\u2018': u"'" 
            , u'\u2019': u"'" 
}

def toAscii( text ) :    
    try:
        for k,v in iteritems( CHAR_SWAP ): 
            text = text.replace(k,v)
    except: pass     
    try: return str( text ) if PY2 else bytes( text, 'replace' ).decode('ascii')
    except UnicodeEncodeError:
        return text.encode('ascii', 'replace').decode('ascii')
    except: return ""

if __name__ == '__main__':     
    print( toAscii( u'testin\u2019' ) )

score 17 · Answer 11 · answered Aug 08 '16 at 10:17

17

Add line below at the beginning of your script ( or as second line):

# -*- coding: utf-8 -*-

That's definition of python source code encoding. More info in PEP 263.

answered Aug 08 '16 at 10:17

Andriy Ivaneyko

20,639
6
60
82

3

This does not solve the problem when processed text loaded from external file contains utf-8 encodings. This helps only for literals written in the given python script itself and is just a clue for python interpreter, but has no impact on text processing. – Mikaelblomkvistsson Jan 08 '19 at 14:43

score 12 · Answer 12 · answered Feb 27 '18 at 14:40

12

I always put the code below in the first two lines of the python files:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

answered Feb 27 '18 at 14:40

Pereira

719
8
22

1

Thank you soo much ! I didn't understand why it was working on other scripts and not on this one. The answer is the from _future_ missing ;) – HamzDiou Jul 23 '21 at 10:00

score 12 · Answer 13 · edited Jul 08 '21 at 03:25

12

It works for me:

export LC_CTYPE="en_US.UTF-8"

edited Jul 08 '21 at 03:25

Simas Joneliunas

2,890
20
28
35

answered Jul 07 '21 at 10:12

Caleb

341
2
7

score 10 · Answer 14 · answered Apr 15 '19 at 21:49

Alas this works in Python 3 at least...

Python 3

Sometimes the error is in the enviroment variables and enconding so

import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
... 
print(myText.encode('utf-8', errors='ignore'))

where errors are ignored in encoding.

score 9 · Answer 15 · answered Aug 23 '21 at 07:00

9

In case its an issue with a print statement, a lot fo times its just an issue with the terminal printing. This helped me : export PYTHONIOENCODING=UTF-8

answered Aug 23 '21 at 07:00

Dreams

5,854
9
48
71

Parag Tyagi · Answer 16 · 2015-12-31T11:18:22.507

8

Simple helper functions found here.

def safe_unicode(obj, *args):
    """ return the unicode representation of obj """
    try:
        return unicode(obj, *args)
    except UnicodeDecodeError:
        # obj is byte string
        ascii_text = str(obj).encode('string_escape')
        return unicode(ascii_text)

def safe_str(obj):
    """ return the byte string representation of obj """
    try:
        return str(obj)
    except UnicodeEncodeError:
        # obj is unicode
        return unicode(obj).encode('unicode_escape')

edited Dec 31 '15 at 11:18

answered Dec 31 '15 at 07:57

Parag Tyagi

8,780
3
42
47

To get the escaped bytestring (to convert arbitrary Unicode string to bytes using ascii encoding), you could use `backslashreplace` error handler: `u'\xa0'.encode('ascii', 'backslashreplace')`. Though you should avoid such representation and configure your environment to accept non-ascii characters instead -- it is 2016! – jfs Jan 01 '16 at 03:05

score 8 · Answer 17 · answered Jan 31 '18 at 05:50

8

Just add to a variable encode('utf-8')

agent_contact.encode('utf-8')

answered Jan 31 '18 at 05:50

Kairat Koibagarov

1,385
15
9

score 7 · Answer 18 · answered Dec 25 '18 at 01:37

7

Please open terminal and fire the below command:

export LC_ALL="en_US.UTF-8"

answered Dec 25 '18 at 01:37

Ngoc-Vuong Ho

111
1
2

score 7 · Answer 19 · answered Mar 09 '21 at 16:18

7

Late answer, but this error is related to your terminal's encoding not supporting certain characters.
I fixed it on python3 using:

import sys
import io

sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')
print("é, à, ...")

answered Mar 09 '21 at 16:18

Pedro Lobito

94,083
31
258
268

score 6 · Answer 20 · answered Nov 27 '16 at 21:59

I just used the following:

import unicodedata
message = unicodedata.normalize("NFKD", message)

Check what documentation says about it:

unicodedata.normalize(form, unistr) Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

In addition to these two forms, there are two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).

The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.

Solves it for me. Simple and easy.

Aravind Krishnakumar · Answer 21 · 2018-05-30T22:57:25.383

Below solution worked for me, Just added

u "String"

(representing the string as unicode) before my string.

result_html = result.to_html(col_space=1, index=False, justify={'right'})

text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report.  Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)

score 5 · Answer 22 · answered Jun 07 '20 at 11:43

5

In general case of writing this unsupported encoding string (let's say data_that_causes_this_error) to some file (for e.g. results.txt), this works

f = open("results.txt", "w")
  f.write(data_that_causes_this_error.encode('utf-8'))
  f.close()

answered Jun 07 '20 at 11:43

Pe Dro

2,651
3
24
44

score 4 · Answer 23 · answered Mar 12 '16 at 03:14

4

I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:

# 'value' contains the problematic data
unic = u''
unic += value
value = unic

I had this idea after reading Ned's presentation.

I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.

answered Mar 12 '16 at 03:14

pepoluan

6,132
4
46
76

3

What is the `type` of value? before and after this? I think why that works is that by doing a `unic += value` which is the same as `unic = unic + value` you are adding a string and a unicode, where python then assumes unicode for the resultant `unic` i.e. the more precise type (think about when you do this `a = float(1) + int(1)`, `a` becomes a float) and then `value = unic` points `value` to the new `unic` object which happens to be unicode. – Tom Myddeltyn May 24 '16 at 21:16

score 4 · Answer 24 · answered Apr 11 '18 at 07:26

We struck this error when running manage.py migrate in Django with localized fixtures.

Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.

The issue was simply that the Django container (we use docker) was missing the LANG env var.

Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.

ZF007 · Answer 25 · 2018-12-25T12:42:24.733

4

Update for python 3.0 and later. Try the following in the python editor:

locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8

This sets the system`s default locale encoding to the UTF-8 format.

More can be read here at PEP 538 -- Coercing the legacy C locale to a UTF-8 based locale.

edited Dec 25 '18 at 12:42

answered Dec 25 '18 at 12:36

ZF007

3,708
8
29
48

score 4 · Answer 26 · answered Mar 26 '20 at 19:34

4

The recommended solution did not work for me, and I could live with dumping all non ascii characters, so

s = s.encode('ascii',errors='ignore')

which left me with something stripped that doesn't throw errors.

answered Mar 26 '20 at 19:34

Gulzar

23,452
27
113
201

score 3 · Answer 27 · answered Jan 03 '19 at 09:03

Many answers here (@agf and @Andbdrew for example) have already addressed the most immediate aspects of the OP question.

However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.

I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky's introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.

yes, my error was on python 2.7, 'a'.format(u'ñ'), and the correct solution is not to use .encode('utf-8') but use always unicode strings, (the default in python 3): u'a'.format(u'ñ'), — Rogelio Triviño, Nov 19 '19 at 13:16

sam ruben · Answer 28 · 2019-10-17T05:48:25.010

3

Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.

Simple tip to avoid :

try: 
    data=str(data)
except:
    data = data #Don't convert to String

The above example will solve Encode error also.

edited Oct 17 '19 at 05:48

answered Jul 05 '19 at 06:11

sam ruben

365
2
6

this doens't work as you'll just run into the error in the except – Aurele Collinet Oct 08 '19 at 10:04

score 1 · Answer 29 · edited Apr 03 '18 at 09:46

1

If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:

unic = u''
packet_data = unic

edited Apr 03 '18 at 09:46

halfer

19,824
17
99
186

answered Mar 15 '18 at 17:43

Nandan Kulkarni

64
4

score 1 · Answer 30 · answered May 07 '21 at 14:27

1

You can set the character encoding to UTF-8 before running your script:

export LC_CTYPE="en_US.UTF-8"

This should generally resolve the issue.

answered May 07 '21 at 14:27

Babatunde Adeyemi

14,360
4
34
26

palswim · Answer 31 · 2019-01-27T06:00:28.450

I had this issue trying to output Unicode characters to stdout, but with sys.stdout.write, rather than print (so that I could support output to a different file as well).

From BeautifulSoup's own documentation, I solved this with the codecs library:

import sys
import codecs

def main(fIn, fOut):
    soup = BeautifulSoup(fIn)
    # Do processing, with data including non-ASCII characters
    fOut.write(unicode(soup))

if __name__ == '__main__':
    with (sys.stdin) as fIn: # Don't think we need codecs.getreader here
        with codecs.getwriter('utf-8')(sys.stdout) as fOut:
            main(fIn, fOut)

score 0 · Answer 32 · answered Oct 14 '19 at 08:08

This problem often happens when a django project deploys using Apache. Because Apache sets environment variable LANG=C in /etc/sysconfig/httpd. Just open the file and comment (or change to your flavior) this setting. Or use the lang option of the WSGIDaemonProcess command, in this case you will be able to set different LANG environment variable to different virtualhosts.

score 0 · Answer 33 · answered May 14 '20 at 08:35

0

This will work:

 >>>print(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", "bats\xc3\xa0")).encode('ascii', 'ignore'))

Output:

>>>bats

answered May 14 '20 at 08:35

huzefausama

432
1
6
22

score -1 · Answer 34 · answered Jan 07 '22 at 15:24

You can you use unicodedata for avoid UnicodeEncodeError. here is an example:

import unicodedata

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = unicodedata.normalize("NFKD", agent_telno) #it will remove all unwanted character like '\xa0'
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

34 Answers34

Linked

Related