Encoding issues while reading a CSV in pandas

Question

I am reading a CSV file with one column which contains text data. When I faced encoding error since the file was not in utf-8, I tried the following 2 solutions:

Solution 1:

df = pd.read_csv("data_encoded.csv", encoding = 'latin-1')

Solution 2:

I changed the encoding explicitly to utf-8 and used
df = pd.read_csv("data_encoded.csv")

Both the solutions solved the error, but I am getting garbage values. For example:

me pretty (changed to)=> me\\r\\rpretty

I noticed the "\r" appended to most of the words when I tokenized them. Is there a pythonic way to remove these.

I have implemented solutions like:

re.replace
filters based on ("\\r")

I am looking for a way to prevent the garbage values forming in the first place. Any suggestions will be helpful

Please provide a sample snippet of (anonymous) csv input file. `\\r` should not be related to the encoding itself and rather be an issue of file origin since those differ depending on the OS. In addition `\\r` looks like an escaped `\r` which may be related to a faulty file reading. Does reading the file with the default `csv` module and the appropriate `newline=''` option lead to the desired output (see https://docs.python.org/3/library/csv.html and https://stackoverflow.com/q/3191528/3991125)? Have a look at `chardet` module for automatic file encoding determination. — albert, Aug 13 '18 at 20:57
You can refer to this answer `https://stackoverflow.com/q/3191528/4662041` — Sheshnath, Aug 13 '18 at 21:00
https://stackoverflow.com/questions/34550120/pandas-escape-carriage-return-in-to-csv https://stackoverflow.com/questions/3191528/csv-in-python-adding-an-extra-carriage-return-on-windows https://github.com/pandas-dev/pandas/issues/3501 — albert, Aug 13 '18 at 21:03

Encoding issues while reading a CSV in pandas

0 Answers0