Delete weird ANSI character and convert accented ones using Python

Question

I've downloaded a bunch of Spanish tweets using the Twitter API, but some of them have strange ANSI characters that I don't want there. I have around 18000 files and I want to remove those characters. I have all my files encoded as UTF-8. For example:

b'Me quedo con una frase de nuestra reuni\xc3\xb3n de hoy.'

If they are accented characters (we have plenty in spanish) I want to delete the accented letter and replace it for the non-accented version of it. That's because after that I'm doing some text mining analysis and I want to unify the words because there could be people not using accents. That b means is in byte mode, I think.

In the case before if I put the following in python:

print(u'Me quedo con una frase de nuestra reuni\xc3\xb3n de hoy con @Colegas')

And I get this in the terminal:

Me quedo con una frase de nuestra reuniÃ³n de hoy con @Colegas

Which I don't like because it's not a used accent in Spanish. There should be the character ó. I don't get why is nor getting it right. I also would like the b at the beginning of the files to disappear. To encode the files I used the following:

f.write(str(FILE.encode('utf-8','strict')))

There I create my files from some json in UTF-8 which contains a lot of keys for each tweet. Maybe I should change it or I'm doing something wrong there.

In some cases there's also a problem when trying to get the characters in the python terminal. For instance:

print(u'\uD83D\uDC1F')

I think that's because python can't represent those characters (� in the example above). Is that so? I would also want to remove them.

Sorry if there's some English mistakes and feel free to ask if something is not clear.

Thanks in advance.

EDIT: I'm using Python 3.4

score 1 · Accepted Answer · edited May 23 '17 at 12:30

You are mixing apples and oranges. b'reuni\xc3\xb3n' is the UTF-8 encoding of u'reuni\u00f3n' which of course is reunión in human-readable format.

>>> print b'reuni\xc3\xb3n'.decode('utf-8')
reunión
>>> repr(b'reuni\xc3\xb3n'.decode('utf-8'))
"u'reuni\\xf3n'"

There is no "ANSI" here (it's a misnomer anyway; commonly it is used to refer to Windows character encodings, but not necessarily the one you expect).

As for how to remove the accents from accented characters, the short version is to normalize to the Unicode "NFD" representation, then discard any code points which have a "diacritic" classification. This is covered e.g. in What is the best way to remove accents in a Python unicode string?; in order to make this answer self-contained, here is the gist of one of the answers to that question -- but do read all of them, if only to decide which suits your use case the best.

import unicodedata
stripped = u"".join([c for c in unicodedata.normalize('NFKD', input_str)
    if not unicodedata.combining(c)])

Thanks. I got the bytes and decoding stuff all mixed up. I get it now. I tried the Unidecode package in python and seems to be doing the same and I find it easier. — Ignacio, May 17 '15 at 15:06

Roland Smith · Answer 2 · 2015-05-17T13:48:01.177

One of the patterns with handling incoming text (in the form of bytes) in Python 3 is to decode them immediately when received.

So you get from twitter;

In [1]: tweetbytes = b'Me quedo con una frase de nuestra reuni\xc3\xb3n de hoy.'

And you do;

In [2]: tweet = tweetbytes.decode('utf-8')

Remember the acronym BADTIE; Bytes Are Decoded, Text Is Encoded.

Now it is text;

In [3]: type(tweet)
Out[3]: str

And you can use it as such;

In [4]: print(tweet)
Me quedo con una frase de nuestra reunión de hoy.

Dirty_Fox · Answer 3 · 2015-05-17T11:41:21.810

-2

First of all : you need to be 100% sure in what language those characters are coded in twitter. If you are sure that it is ANSI (normally spanish encoding language will be Latin-1), then everything you get from twitter you need to use this function

a = getStufFromTwitter() #you parse twitter 
myStr = a.encode('Latin-1')

the .encode('ANSI') will tell python that everything you are taking from the outside is written in ANSI and he should convert it in Unicode.

Then, whenever you want to re use myStr in any part of your program (especially if you want to write it somewhere), you have to use the decode function. IN your case that will be :

with open('myfile.txt','w') as f:
    f.write(myStr.decode('UTF-8'))

This should work. However it would be easier to help you if we could see much of the code. You have some very vicious specifications in Python (are you using Python 2.7 ? If yes, add at the begining of every of your script the folowing :

from __future__ import unicode_literals

Once again, it is a very tricky part of python.

edited May 17 '15 at 11:41

answered May 17 '15 at 11:11

Dirty_Fox

1,611
4
20
24

The thing is I can't get the tweets again from Twitter. I have all of them now and due to Twitter API liminations I can't get them again. I have those ansi characters delimited with \u. I just installed Unidecode 0.04.17 and seems to work fine, but I'm still not confident about that because I don't have too much knoledge about encodings and all that stuff. – Ignacio May 17 '15 at 11:31
Then try .decode('Latin-1') before you put them in the file. Or .decode('utf-8'). Hope this work otherwise, without much information, it will be difficult to help. – Dirty_Fox May 17 '15 at 11:40
@Dirty_Fox You cannot `encode` bytes in Python 3. The acronym BADTIE will help you remember; Bytes Are Decoded, Text is Encoded. – Roland Smith May 17 '15 at 13:36

Delete weird ANSI character and convert accented ones using Python

3 Answers3