What I'm trying to do: I'm getting from a database a list of uris and download them, removing the stopwords and counting the frequency that the words appears in the webpage, then trying to save in the mongodb.
The Problem: When I try to save the result in the database I get the error bson.errors.invalidDocument: the document must be a valid utf-8
it appears to be related to the codes '\xc3someotherstrangewords', '\xe2something' when I'm processing the webpages I try remove the punctuation, but I can't remove accents because I'll get a wrong word.
What I already tried I've tried identify the char encode through the header from the webpage I've tried utilize the chardet
utilize the re.compile(r"[^a-zA-Z]") and/or unicode(variable,'ascii', 'ignore');
that isn't good for non-English languages because they remove the accents.
What I want know is:
anyone know how identify the chars and translate to the right word/encode?
e.g. get this from webpage '\xe2' and translate to 'â'
(English isn't my first language so forgive me) EDIT: if anyone want see the source code