I've downloaded a bunch of Spanish tweets using the Twitter API, but some of them have strange ANSI characters that I don't want there. I have around 18000 files and I want to remove those characters. I have all my files encoded as UTF-8. For example:
b'Me quedo con una frase de nuestra reuni\xc3\xb3n de hoy.'
If they are accented characters (we have plenty in spanish) I want to delete the accented letter and replace it for the non-accented version of it. That's because after that I'm doing some text mining analysis and I want to unify the words because there could be people not using accents.
That b
means is in byte mode, I think.
In the case before if I put the following in python:
print(u'Me quedo con una frase de nuestra reuni\xc3\xb3n de hoy con @Colegas')
And I get this in the terminal:
Me quedo con una frase de nuestra reunión de hoy con @Colegas
Which I don't like because it's not a used accent in Spanish. There should be the character ó. I don't get why is nor getting it right.
I also would like the b
at the beginning of the files to disappear.
To encode the files I used the following:
f.write(str(FILE.encode('utf-8','strict')))
There I create my files from some json in UTF-8 which contains a lot of keys for each tweet. Maybe I should change it or I'm doing something wrong there.
In some cases there's also a problem when trying to get the characters in the python terminal. For instance:
print(u'\uD83D\uDC1F')
I think that's because python can't represent those characters (� in the example above). Is that so? I would also want to remove them.
Sorry if there's some English mistakes and feel free to ask if something is not clear.
Thanks in advance.
EDIT: I'm using Python 3.4