Convert a Unicode string to a string in Python (containing extra symbols)

Question

How do you convert a Unicode string (containing extra characters like £ $, etc.) into a Python string?

What do you mean by "a python string"? Do you want to encode the unicode string? — JacquesB, Jul 30 '09 at 15:48
I'm getting unicode sent from a form on a HTML window with symbols i want to be able to save to a file, but its not working — William Troup, Jul 30 '09 at 15:57
I doubt the you get unicode from a web request. You probalby get UTF-8 encoded Unicode. — , Jul 30 '09 at 16:15
We need to know what Python version you are using, and what it is that you are calling a Unicode string. Do the following on a short unicode_string that includes the currency symbols that are causing the bother: Python 2.x : `print type(unicode_string), repr(unicode_string)` Python 3.x : `print type(unicode_string), ascii(unicode_string)` Then edit your question and copy/paste the results of the above print statement. DON'T retype the results. Also look up near the top of your HTML and see if you can find something like this: — John Machin, Jul 30 '09 at 16:13
You should really clarify what you mean by *unicode string* and *python string* (giving concrete examples would be the best I guess) as it's clear from comments there are different interpretations of your question. I wonder why you haven't done this although it's over 3,5 years since you asked this question. — Piotr Dobrogost, Jan 21 '13 at 12:45
@jalf: If it is *encoded*; it is no longer Unicode e.g., `unicode_string = u"I'm unicode string"; bytestring = unicode_string.encode('utf-8'); unicode_again = bytestring.decode('utf-8')` — jfs, Dec 21 '13 at 01:47
@J.F.Sebastian: You mean "it is not of the Python Unicode string datatype" (which foes without saying, because what you receive over a network socket from a HTTP request is a stream of bytes, and not a Python value), but UTF-8 text most certainly is Unicode. That is kind of the entire point in the UTF-8 encoding. — jalf, Dec 21 '13 at 10:38
@jalf: utf-8 is a character encoding. You can use it to interpret a sequence of bytes as text (sequence of Unicode codepoints -- that you may call Unicode text (it has *nothing* to do with Python)). Sequence of bytes itself is not a Unicode string. — jfs, Dec 21 '13 at 10:57
@J.F.Sebastian But we are not talking about "a sequence of bytes itself". We are talking about a string encoded as UTF-8. There is **no** possible way in which "a string encoded as UTF-8 is not a Unicode string, because UTF-8 is a Unicode encoding. It does not encode cars, sunsets, emotions or waffles. It encodes Unicode text. A text encoded as UTF-8 is a Unicode text. I am simply reacting to your incorrect statement that "a string which is encoded is no longer Unicode". — jalf, Dec 21 '13 at 14:01
@wnys (plus encoding rot-13): Let's check whether an encoded string is the same as original. fyi, `wnys` is `jalf` encoded using rot-13 encoding. — jfs, Dec 21 '13 at 16:26
Hopefully future passers-by come to understand that when you say something is "encoded" you are saying "it's not what it actually is, it's a representation of another thing in a form that we can handle with specific restrictions." E.g. using UTF-8 so that C string handling utilities "work," despite C not knowing anything of Unicode or UTF. — dash-tom-bang, Sep 17 '15 at 23:20
Retagged this as a 2.x question because **it is incoherent in 3.x**: "a unicode string" **is** "a Python string" in every possible meaningful sense in 3.x. (In 2.x, `str` means a bytes type that is not a real string type, but really Unicode strings are still "Python strings"...). — Karl Knechtel, May 24 '23 at 05:45

score 625 · Accepted Answer · edited Jun 09 '20 at 12:05

625

See unicodedata.normalize

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
'Kluft skrams infor pa federal electoral groe'

edited Jun 09 '20 at 12:05

phoenix

7,988
6
39
45

answered Jul 30 '09 at 15:44

Sorantis

14,496
5
31
37

28

+1 answers the question as worded, @williamtroup's problem of not being able to save unicode to a file sounds like an entirely different issue worthy of a separate question – Mark Roddy Jul 30 '09 at 16:03
5

@John - that answer predates the OP's clarification. – Dominic Rodger Jul 30 '09 at 16:16
10

@Mark Roddy: His question as written is how to convert a "Unicode string" (whatever he means by that) containing some currency symbols to a "Python string" (whatever ...) and you think that a remove-some-diacritics delete-other-non-ascii characters kludge answers his question??? – John Machin Jul 30 '09 at 16:25
2

@Dominic: I'm very sorry; I'll rephrase that: The OP's unclarified question said he wanted to CONVERT it TO A PYTHON STRING, not mangle it. – John Machin Jul 30 '09 at 17:19
Note that normalize() does not handle Unicode punctuation (e.g., smart quotes, apostrophes, dashes), probably because punctuation characters are not composite characters. There's a good discussion and alternate solution here: http://stackoverflow.com/a/816319/234823 – David Jul 17 '12 at 16:33
2

This answer, as it is, without any reservations is plainly **WRONG** as hinted by @JohnMachin. Please consider **VOTING IT DOWN**. – Piotr Dobrogost Jan 21 '13 at 12:51
14

@JohnMachin This answers the question word for word: The **only** way to convert a `unicode` string to a `str` is to either drop or convert the characters that cannot be represented in ASCII. So +1 from me. – Izkata Oct 14 '13 at 21:45
1

@PiotrDobrogost See my prior comment as well. – Izkata Oct 14 '13 at 21:45
5

@lzkata: no, it is not. `type(title) == unicode and type(title.encode('utf-8')) == str`. No need to corrupt the input, to get a bytestring that can be saved to a file. – jfs Dec 21 '13 at 01:53
2

Why does this have so many upvotes? And why is it the accepted answer? It's a good way to strip diacritics from Latin text, which has its uses (implementing a semi-naïve search feature, for example), but it is NOT what the OP was asking. – rmunn Mar 27 '14 at 04:04
3

this is an utter embarrassment. please do not arbitrarily destroy parts of characters in foreign languages. (this will completely remove any CJK text, for example.) fix whatever broken system is choking on them in the first place. – Eevee Aug 27 '15 at 07:58
@J.F.Sebastian You can save a not-encoded string to a file as well, presumably, but the problem then becomes one of retrieving it from that file. Without a standardized mechanism to interpret that file (e.g. XML with a designated character encoding) then all bets are off. These days you can assume UTF-8 but isn't assuming things what gave us 8-bit chars in the first place? – dash-tom-bang Sep 17 '15 at 23:24
1

What is `NFKD` (passed as the first argument to `unicodedata.normalize`)? – joshreesjones May 25 '16 at 22:16
@joshreesjones https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize – Thoughtful Dragon May 30 '16 at 22:17
1

Well, I "missearched" and ended up here, with the exact code I want to do, so ... – Fábio Dias Jul 21 '16 at 23:11
4

Please do not use this code! Completely deleting characters like a German "ß" is in no way converting. This code "converts" `Fuß` to `Fu` or `groß` to `gro` where neither `Fu` nor `gro` have any meaning in German. The same holds true for other languages where `Rødgrød` becomes `Rdgrd`. – z80crew May 11 '18 at 09:58
This answer indeed doesn't answer the question. – David Callanan Aug 30 '18 at 12:50
@jfs , when I `title.encode('utf-8')` this: `u"Klüft skräms inför på fédéral électoral große"`, becomes this: `"Kl├╝ft skr├ñms inf├╢r p├Ñ f├⌐d├⌐ral ├⌐lectoral gro├ƒe"` ; is that what you meant by "bytestring"? It seems "mangled" to me; did I do something wrong? Is this the expected result? – Nate Anderson Apr 03 '19 at 22:04
@TheRedPea you did wrong. The result is mojibake. To write Unicode text to a file in Python https://stackoverflow.com/a/35086151/4279 – jfs Apr 04 '19 at 01:18
@jfs thank you. But I'm not interested in writing to file. OP doesn't mention writing to file. Your linked answer shows how to write a *unicode* object directly to file. But this question is about strings, not files; I'm referring to your earlier comment `type(title.encode('utf-8')) == str` can you print the result for me when you run this in Python 2.# ? `u"Klüft skräms inför på fédéral électoral große".encode('utf-8')` The `type(..)` will be `str` as you say, but what's the result of `encode`? You said "No need to corrupt the input"; how can I avoid mojibake/ corrupting input? – Nate Anderson Apr 04 '19 at 16:00
I guess doing `encode`, i.e. `u"Klüft".encode('utf-8')` as @jfs suggested will replace the `ü` ; the result is this string: `'Kl\xc3\xbcft'`. [Image here.](https://i.postimg.cc/V6n2NT6D/2019-04-04-10-10-13-Convert-a-Unicode-string-to-a-string-in-Pyth.png) *Printing* the latter string, will appear as mojibake. *Decoding* the latter string will go back to a `unicode` object: `u"Klüft"`. Same is shown in [answers below](https://stackoverflow.com/a/1207496/1175496). I think this is what @Izkata said: "convert the characters that cannot be represented in ASCII." i.e. `u'ü'` -> `'\xc3\xbc'` – Nate Anderson Apr 04 '19 at 16:16
@TheRedPea the mojibake in your comment indicates that you did write the text (unicode type on Python 2) encoded to bytes (str type on Python 2) using one character encoding to a file and then read the file using a different character encoding. If you want to *print* text, use unicode, don't encode to bytes prematurely (the point of the linked answer). If you want to use bytes to represent text (you shouldn't), use the same encoding for writing&reading (e.g., see `sys.stdout.encoding` if it is set — yes, `sys.stdout` is a file and you use it when you print). – jfs Apr 04 '19 at 17:14
"(e.g., see sys.stdout.encoding if it is set — yes, sys.stdout is a file and you use it when you print)." great, thanks – Nate Anderson Apr 04 '19 at 17:24

score 341 · Answer 2 · edited Jan 09 '14 at 04:19

341

You can use encode to ASCII if you don't need to translate the non-ASCII characters:

>>> a=u"aaaàçççñññ"
>>> type(a)
<type 'unicode'>
>>> a.encode('ascii','ignore')
'aaa'
>>> a.encode('ascii','replace')
'aaa???????'
>>>

edited Jan 09 '14 at 04:19

Peter Mortensen

30,738
21
105
131

answered Jul 31 '09 at 07:13

Ferran

14,563
2
21
12

5

Awesome answer. Exactly what I needed. Also, great presentation to show the effect of `ignore` vs `replace` – Jonny Brooks Apr 11 '17 at 12:19
or `a.encode('ascii', 'xmlcharrefreplace')` gives `'aaaàçççñññ'`. – Bob Stein Apr 10 '19 at 17:22
`type(a)` is `str` in Python 3.6.8 and doesn't have any `encode()` method. – Ali Tou Aug 24 '19 at 10:16
python statement: a.encode('ascii','ignore') result: b'aaa' – Cristian May 21 '21 at 03:38

score 158 · Answer 3 · edited Jan 05 '17 at 15:48

158

>>> text=u'abcd'
>>> str(text)
'abcd'

If the string only contains ascii characters.

edited Jan 05 '17 at 15:48

Mauricio

5,854
2
28
34

answered Oct 25 '12 at 16:27

igco

1,821
1
11
2

21

This would only work on windows. And will break if there are non-ascii symbols. – Vanuan Jul 30 '13 at 10:50
8

This breaks if the content of the string is actually unicode, not just ascii characters in a unicode string. Don't do this, you'll get random UnicodeEncodeError exceptions all over the place. – Doug Oct 09 '13 at 07:31
12

This answer helped me. If you know that your string is ascii and you need to cast it back to a non-unicode string, this is very useful. – VedTopkar Oct 16 '14 at 16:04

score 122 · Answer 4 · edited Jan 09 '14 at 04:15

If you have a Unicode string, and you want to write this to a file, or other serialised form, you must first encode it into a particular representation that can be stored. There are several common Unicode encodings, such as UTF-16 (uses two bytes for most Unicode characters) or UTF-8 (1-4 bytes / codepoint depending on the character), etc. To convert that string into a particular encoding, you can use:

>>> s= u'£10'
>>> s.encode('utf8')
'\xc2\x9c10'
>>> s.encode('utf16')
'\xff\xfe\x9c\x001\x000\x00'

This raw string of bytes can be written to a file. However, note that when reading it back, you must know what encoding it is in and decode it using that same encoding.

When writing to files, you can get rid of this manual encode/decode process by using the codecs module. So, to open a file that encodes all Unicode strings into UTF-8, use:

import codecs
f = codecs.open('path/to/file.txt','w','utf8')
f.write(my_unicode_string)  # Stored on disk as UTF-8

Do note that anything else that is using these files must understand what encoding the file is in if they want to read them. If you are the only one doing the reading/writing this isn't a problem, otherwise make sure that you write in a form understandable by whatever else uses the files.

In Python 3, this form of file access is the default, and the built-in open function will take an encoding parameter and always translate to/from Unicode strings (the default string object in Python 3) for files opened in text mode.

score 60 · Answer 5 · answered Jul 30 '09 at 15:46

60

Here is an example:

>>> u = u'€€€'
>>> s = u.encode('utf8')
>>> s
'\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac'

answered Jul 30 '09 at 15:46

Bastien Léonard

60,478
20
78
95

1

Can anyone explain why, when I encode the Euro symbol to `utf8` as shown here, the result is only question marks? Here is [an image](https://i.postimg.cc/MTqgv5Qw/2019-04-04-10-18-54-Convert-a-Unicode-string-to-a-string-in-Pyth.png) of my Python, version 2.7.13. (I can encode other unicode objects like `u"Klüft"`, but not the Euros?) – Nate Anderson Apr 04 '19 at 16:20

score 12 · Answer 6 · answered Nov 28 '19 at 13:09

file contain unicode-esaped string

\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0437\\u0430\\u0446\\u0438\\u044f .....\",

for me

 f = open("56ad62-json.log", encoding="utf-8")
 qq=f.readline() 

 print(qq)                          
 {"log":\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0440\\u0438\\u0437\\u0430\\u0446\\u0438\\u044f \\u043f\\u043e\\u043b\\u044c\\u0437\\u043e\\u0432\\u0430\\u0442\\u0435\\u043b\\u044f\"}

(qq.encode().decode("unicode-escape").encode().decode("unicode-escape")) 
# '{"log":"message": "Авторизация пользователя"}\n'

it worked even if i only use: ```result.encode().decode('unicode-escape')``` — Ammad Khalid, Jan 15 '20 at 02:33

JAB · Answer 7 · 2009-07-30T16:14:33.913

6

Well, if you're willing/ready to switch to Python 3 (which you may not be due to the backwards incompatibility with some Python 2 code), you don't have to do any converting; all text in Python 3 is represented with Unicode strings, which also means that there's no more usage of the u'<text>' syntax. You also have what are, in effect, strings of bytes, which are used to represent data (which may be an encoded string).

http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

(Of course, if you're currently using Python 3, then the problem is likely something to do with how you're attempting to save the text to a file.)

edited Jul 30 '09 at 16:14

answered Jul 30 '09 at 16:09

JAB

20,783
6
71
80

2

In Python 3 strings are Unicode strings. They are never encoded. I found the following text useful: http://www.joelonsoftware.com/articles/Unicode.html – Jul 30 '09 at 16:14
He wants to save it to a file; how does your answer help with that? – John Machin Jul 30 '09 at 16:15
@lutz: Right, I'd forgotten that Unicode is a character map rather than an encoding. @John: There isn't enough information at the moment to know what the problem with saving it is. Is he getting an error? Is he not getting any errors, but when opening the file externally he gets mojibake? Without that information, there are far too many possible solutions that could be provided. – JAB Jul 30 '09 at 16:24
@Cat: There isn't any information at the moment to know what he's got, let alone what his saving problem is. I've asked him to provide some facts -- see my answer. – John Machin Jul 30 '09 at 16:35

score 5 · Answer 8 · answered Nov 16 '20 at 14:10

There is a library that can help with Unicode issues called ftfy. Has made my life easier.

Example 1

import ftfy
print(ftfy.fix_text('uÌˆnicode'))

output -->
ünicode

Example 2 - UTF-8

import ftfy
print(ftfy.fix_text('\xe2\x80\xa2'))

output -->
•

Example 3 - Unicode code point

import ftfy
print(ftfy.fix_text(u'\u2026'))

output -->
…

https://ftfy.readthedocs.io/en/latest/

pip install ftfy

https://pypi.org/project/ftfy/

score 3 · Answer 9 · answered Dec 19 '16 at 07:59

3

Here is an example code

import unicodedata    
raw_text = u"here $%6757 dfgdfg"
convert_text = unicodedata.normalize('NFKD', raw_text).encode('ascii','ignore')

answered Dec 19 '16 at 07:59

Gihan Chathuranga

442
10
16

how this answer is different from the accepted answer ? – sgauri Jun 30 '18 at 09:51

score 2 · Answer 10 · answered Nov 05 '19 at 20:40

No answere worked for my case, where I had a string variable containing unicode chars, and no encode-decode explained here did the work.

If I do in a Terminal

echo "no me llama mucho la atenci\u00f3n"

or

python3
>>> print("no me llama mucho la atenci\u00f3n")

The output is correct:

output: no me llama mucho la atención

But working with scripts loading this string variable didn't work.

This is what worked on my case, in case helps anybody:

string_to_convert = "no me llama mucho la atenci\u00f3n"
print(json.dumps(json.loads(r'"%s"' % string_to_convert), ensure_ascii=False))
output: no me llama mucho la atención

score 1 · Answer 11 · answered Jun 08 '22 at 10:12

This is my function

import unicodedata
def unicode_to_ascii(note):
    str_map = {'Š' : 'S', 'š' : 's', 'Đ' : 'D', 'đ' : 'd', 'Ž' : 'Z', 'ž' : 'z', 'Č' : 'C', 'č' : 'c', 'Ć' : 'C', 'ć' : 'c', 'À' : 'A', 'Á' : 'A', 'Â' : 'A', 'Ã' : 'A', 'Ä' : 'A', 'Å' : 'A', 'Æ' : 'A', 'Ç' : 'C', 'È' : 'E', 'É' : 'E', 'Ê' : 'E', 'Ë' : 'E', 'Ì' : 'I', 'Í' : 'I', 'Î' : 'I', 'Ï' : 'I', 'Ñ' : 'N', 'Ò' : 'O', 'Ó' : 'O', 'Ô' : 'O', 'Õ' : 'O', 'Ö' : 'O', 'Ø' : 'O', 'Ù' : 'U', 'Ú' : 'U', 'Û' : 'U', 'Ü' : 'U', 'Ý' : 'Y', 'Þ' : 'B', 'ß' : 'Ss', 'à' : 'a', 'á' : 'a', 'â' : 'a', 'ã' : 'a', 'ä' : 'a', 'å' : 'a', 'æ' : 'a', 'ç' : 'c', 'è' : 'e', 'é' : 'e', 'ê' : 'e', 'ë' : 'e', 'ì' : 'i', 'í' : 'i', 'î' : 'i', 'ï' : 'i', 'ð' : 'o', 'ñ' : 'n', 'ò' : 'o', 'ó' : 'o', 'ô' : 'o', 'õ' : 'o', 'ö' : 'o', 'ø' : 'o', 'ù' : 'u', 'ú' : 'u', 'û' : 'u', 'ý' : 'y', 'ý' : 'y', 'þ' : 'b', 'ÿ' : 'y', 'Ŕ' : 'R', 'ŕ' : 'r'}
    for key, value in str_map.items():
        note = note.replace(key, value)
    asciidata = unicodedata.normalize('NFKD', note).encode('ascii', 'ignore')
    return asciidata.decode('UTF-8')

score 1 · Answer 12 · answered Jun 30 '22 at 12:38

I have made the following function which lets you control what to keep according to the General_Category_Values in Unicode (https://www.unicode.org/reports/tr44/#General_Category_Values)

def FormatToNameList(name_str):
    import unicodedata
    clean_str = ''
    for c in name_str:
        if unicodedata.category(c) in ['Lu','Ll']:
            clean_str += c.lower()
            print('normal letter: ',c)
        elif unicodedata.category(c) in ['Lt','Lm','Lo']:
            clean_str += c
            print('special letter: ',c)
        elif unicodedata.category(c) in ['Nd']:
            clean_str += c
            print('normal number: ',c)
        elif unicodedata.category(c) in ['Nl','No']:
            clean_str += c
            print('special number: ',c)
        elif unicodedata.category(c) in ['Cc','Sm','Zs','Zl','Zp','Pc','Pd','Ps','Pe','Pi','Pf','Po']:
            clean_str += ' '
            print('space or symbol: ',c)
        else:
            print('other: ',' : ',c,' unicodedata.category: ',unicodedata.category(c))    
    name_list = clean_str.split(' ')
    return clean_str, name_list
if __name__ == '__main__':
     u = 'some3^?"Weirdstr '+ chr(231) + chr(0x0af4)
     [clean_str, name_list] = FormatToNameList(u)
     print(clean_str)
     print(name_list)

See also https://docs.python.org/3/howto/unicode.html

Convert a Unicode string to a string in Python (containing extra symbols)

12 Answers12

Linked

Related