unknown Characters in scraped data

Question

i am using Pandas and importing a csv file of a table of rows and columns, mostly text. Some of the text contains these characters below, some repeat multiple times, here is an example. not sure what they are or how to handle them. Im trying multiple encodings and they change but dont go away... Is there a script/process/encoding to clean these types of chars up?

ENCODING UTF-8
.billion stored in the world‚Äôs largest database bought for ¬£6, according to an investigation
.Caused, the NMBS said, by a data worker ‚Äúclicking on the wrong button‚Äù.'
.there‚Äôs a good chance that you‚Äôre one of, one of the nation‚Äôs three major credit reporting agencies.'

ENCODING CP1252
.billion stored in the worldâ€šÃ„Ã´s largest database bought for Â¬Â£6, according to an investigation
.Caused, the NMBS said, by a data worker â€šÃ„Ãºclicking on the wrong buttonâ€šÃ„Ã¹.
.thereâ€šÃ„Ã´s a good chance that youâ€šÃ„Ã´re one of, one of the nationâ€šÃ„Ã´s three major credit reporting agencies.'

`‚Äô` is utf-8 for the curly apostrophe. We will need to see more about how you open the file in question. Also, if the csv is displaying correctly in excel/libre. If you can provide a more complete example it would help. — JonSG, Jul 21 '21 at 17:33
these are the char in Excel, that i will find in some sentences... âˆšâ€¢ ˆšâ€¢ â€šÃ„Ã¬ â€šÃ„Ãº â€šÃ„Ã¹ â€šÃ„Ã Â¬Â£6 âˆšÂ§ â€šÃ„Ã´. in Pandas i just use pd.read_csv(filelocation) no encoding or anything.. i have tried several encoding methods to see it it gets cleaned up but it just changes the cahrs.. see above encoding=utf-8 or encoding=1252. the output is above for each — jon rios, Jul 21 '21 at 19:07

jon rios · Answer 1 · 2021-07-21T23:13:52.800

i ended up just finding all the individual characters and replacing them with nothing. sentences seem to read fine, missing a couple of apostrphe but still readable

spec_chars =['Ä', 'ù', 'ú', 'ì', 'ô', 'ˆ', 'š', '€', '¢', 'Ã', '„', '¬', 'º', '¹', 'Â', '£', '§', '´']

for i in spec_chars:
     mytext= mytext.replace(i, "")

#or over entire DF

df.replace(regex='[ÄùÄúÄìÄôÄùˆš€¢Ã„¬º¹Â£§´]', value="", inplace=True)

DonCarleone · Answer 2 · 2021-07-21T17:43:27.100

0

You can include all words that don't have foreign chars

from string import ascii_letters, punctuation

words = [<list_of_words>]
allowed = set(ascii_letters+punctuation)

output = [word for word in words if all(letter in allowed for letter in word)]

See Python - remove elements (foreign characters) from list

edited Jul 21 '21 at 17:43

answered Jul 21 '21 at 17:19

DonCarleone

544
11
20

i have like 10 columns and 600 rows with full sentences. does that mean i need to make a list of every word in the entire table? – jon rios Jul 21 '21 at 19:11
Presumably. I mean you're processing unstructured text, isn't that part of the process? You can try `NLTK` too, you may find something helpful there – DonCarleone Jul 21 '21 at 19:34

unknown Characters in scraped data

2 Answers2