Getting rid of unicode characters in a list

Question

I've tried str() and x.encode('UTF8'). Is there a quick and easy way to remove the unicode chars? My list looks like this:

mcd = [u'Chicken saut\xc3\xa9ed potatoes',  'Roasted lamb with mash potatoes', 'Rabbit casserole with tarragon, mushrooms and dijon mustard sauce. Served with mash potatoes']

The reason I am trying to get rid of the u's is becasue I'd like to copy this data onto a CSV file. It gives me an error like the one below when i try to do so...

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-10: ordinal not in range(128)

I thought it would be easier just to remove unicode altogether.

Thanks in advance!

Why would you want to? For almost every imaginable scenario, you should want and need your strings to be Unicode. What are you actually trying to accomplish? — tripleee, Sep 30 '15 at 14:02
`u'saut\xc3\xa9ed'` looks problematic, though. You are probably reading in the input incorrectly in the first place. I guess that's what you should fix instead. It should apparently be `u'saut\xe9ed'`. — tripleee, Sep 30 '15 at 14:07
By definition all characters in that string are unicode characters. But I don't think you want to remove *all* characters. Which ones did you want to keep? — flodin, Sep 30 '15 at 14:19
Using mcd = [u'Chicken saut\xc3\xa9ed potatoes', 'Roasted lamb with mash potatoes', 'Rabbit casserole with tarragon, mushrooms and dijon mustard sauce. Served with mash potatoes'] new = [str(m) for m in mcd] for m,n in zip(mcd,new): # compare before and after print type(m), type(n) give me and error: File "combined.py", line 31, in new = [str(m) for m in mcd] UnicodeEncodeError: 'ascii' codec can't encode characters in position 119-120: ordinal not in range(128). — MarkJ, Sep 30 '15 at 14:23
@tripleee I need to remove the unicode becasue of the error I'm getting. My post has been updated. As for the u'saut\xc3\xa9ed', this data is off of a website I scrapped. — MarkJ, Sep 30 '15 at 14:38

areuexperienced · Answer 1 · 2015-09-30T14:08:50.997

3

This works for me:

mcd = [u'Chicken saut\xc3\xa9ed potatoes',  'Roasted lamb with mash potatoes', 'Rabbit casserole with tarragon, mushrooms and dijon mustard sauce. Served with mash potatoes']

new = [str(m) for m in mcd]

for m,n in zip(mcd,new): # compare before and after
    print type(m), type(n)

OUT:

<type 'unicode'> <type 'str'>
<type 'str'> <type 'str'>
<type 'str'> <type 'str'>

If the above doesn't work (see convo in comments):

new = [m.encode('utf-8') for m in mcd]

edited Sep 30 '15 at 14:08

answered Sep 30 '15 at 13:57

areuexperienced

1,991
2
17
27

what version of python are you using it shows error for me `UnicodeEncodeError: 'ascii' codec can't encode characters in position 12-13: ordinal not in range(128)` i am using 2.7 – The6thSense Sep 30 '15 at 13:59
`sys.version` `'2.7.6 (default, Jan 3 2014, 16:42:21) \n[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)]'` via iPython Notebook – areuexperienced Sep 30 '15 at 14:01
1

I get the same error as Vignesh Kalai, I am using 2.7 as well. – Joe T. Boka Sep 30 '15 at 14:03
1

I can confirm I am getting the same error via a normal python2.7 shell but not via iPython Notebook. Weird. – areuexperienced Sep 30 '15 at 14:05
I just tried it using iPython Notebook with Python 3 and it works – Joe T. Boka Sep 30 '15 at 14:06
I tried it with both 2.7 and 3.4. Getting the same error, – MarkJ Sep 30 '15 at 14:40
Did you try using `.encode('utf-8')` rather than `str()`? – areuexperienced Sep 30 '15 at 15:12

score 1 · Answer 2 · edited May 23 '17 at 12:06

The problem is probably that you are pressing enter instead of printing the result. This calls repr instead of str. Quoting the doc:

In the interactive interpreter, the output string is enclosed in quotes and special characters are escaped with backslashes. While this might sometimes look different from the input (the enclosing quotes could change), the two strings are equivalent. reference

Let me show you:

In [1]: mcd = [u'Chicken saut\xc3\xa9ed potatoes',  'Roasted lamb with mash potatoes', 'Rabbit casserole with tarragon, mushrooms and dijon mustard sauce. Served with mash potatoes']

In [2]: mcd[0]
Out[2]: u'Chicken saut\xc3\xa9ed potatoes'

In [3]: print repr(mcd[0])
u'Chicken saut\xc3\xa9ed potatoes'

In [4]: print mcd[0]  # Here will use my current OS encoding, i think utf8 in my case
Chicken sautÃ©ed potatoes

In [5]: print mcd[0].encode('utf8')  # yes! i was right
Chicken sautÃ©ed potatoes

You should choose the encoding type first, i think in this case you have to use latin1:

In [20]: print mcd[0].encode('latin1')
Chicken sautéed potatoes

Hope to have helped.

Edit: I hadn't seen the edit of the question, if you want to replace the characters, check this answer

tripleee · Answer 3 · 2015-10-05T05:21:52.507

If the strings you have obtained are the result of web site scraping, it appears that the site you got them off has an incorrect encoding setting.

It is fairly common for sites to specify charset=utf-8 and then have the site's content actually in some other character set (windows-1252 in particular) or vice versa. There is no simple, universal workaround for this phenomenon (also known as mojibake).

You might want to try with different scraping libraries -- most have some sort of tactic for identifying and coping with this scenario, but they have different success rates in different scenarios. If you are using BeautifulSoup, you might want to experiment with different parameters to the chardet back end.

Of course, if you only care about correctly scraping a single site, you can hard-code an override for the site's claimed character encoding.

Your question as such doesn't make much sense. It's not really clear what you are trying to accomplish. u'Chicken and sauted potatoes' is no more correct and only marginally less unappealing than u'Chicken and sautÃ©ed potatoes' (and in some ways more unappealing, because you can't tell that there was an attempt to make it right, although it wasn't competently executed).

If you get an encoding error because you are feeding Unicode to a file handle with an ASCII encoding, the correct solution for that is to specify an encoding other than ASCII (commonly, UTF-8) when opening the file for writing.

Getting rid of unicode characters in a list

3 Answers3