22

I'm trying to read and write a dataframe to a pipe-delimited file. Some of the characters are non-Roman letters (`, ç, ñ, etc.). But it breaks when I try to write out the accents as ASCII.

df = pd.read_csv('filename.txt',sep='|', encoding='utf-8')
<do stuff>
newdf.to_csv('output.txt', sep='|', index=False, encoding='ascii')

-------

  File "<ipython-input-63-ae528ab37b8f>", line 21, in <module>
    newdf.to_csv(filename,sep='|',index=False, encoding='ascii')

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1344, in to_csv
    formatter.save()

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1551, in save
    self._save()

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1652, in _save
    self._save_chunk(start_i, end_i)

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1678, in _save_chunk
    lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)

  File "pandas\lib.pyx", line 1075, in pandas.lib.write_csv_rows (pandas\lib.c:19767)

UnicodeEncodeError: 'ascii' codec can't encode character '\xb4' in position 7: ordinal not in range(128)

If I change to_csv to have utf-8 encoding, then I can't read the file in properly:

newdf.to_csv('output.txt',sep='|',index=False,encoding='utf-8')
pd.read_csv('output.txt', sep='|')

> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 2: invalid start byte

My goal is to have a pipe-delimited file that retains the accents and special characters.

Also, is there an easy way to figure out which line read_csv is breaking on? Right now I don't know how to get it to show me the bad character(s).

ale19
  • 1,327
  • 7
  • 23
  • 38
  • 1
    Possible duplicate of [Pandas writing dataframe to CSV file](http://stackoverflow.com/questions/16923281/pandas-writing-dataframe-to-csv-file) – mbecker Dec 19 '16 at 18:30
  • Are you normalizing your unicode strings to remove accents? I thought ASCII can't handle those letters... – juanpa.arrivillaga Dec 19 '16 at 18:35
  • @juanpa.arrivillaga: I edited my post to clarify what i'm looking for in my output. – ale19 Dec 19 '16 at 18:40
  • @ale19 you cannot encode accents and special characters in ASCII. It is a bare-bones representation. That is *why* encodings like UTF-8 exist. Just write it out in UTF-8. – juanpa.arrivillaga Dec 19 '16 at 18:54

5 Answers5

72

Check the answer here

It's a much simpler solution:

newdf.to_csv('filename.csv', encoding='utf-8')
Ohad Zadok
  • 3,452
  • 1
  • 22
  • 26
10

You have some characters that are not ASCII and therefore cannot be encoded as you are trying to do. I would just use utf-8 as suggested in a comment.

To check which lines are causing the issue you can try something like this:

def is_not_ascii(string):
    return string is not None and any([ord(s) >= 128 for s in string])

df[df[col].apply(is_not_ascii)]

You'll need to specify the column col you are testing.

Jukurrpa
  • 4,038
  • 7
  • 43
  • 73
Alex
  • 12,078
  • 6
  • 64
  • 74
  • Thanks. When I try your function (specifying the column), I get TypeError: ord() expected a character, but string of length 17 found. I'm assuming this is because ord() checks individual characters, but the column in question contains strings. – ale19 Dec 19 '16 at 19:23
  • If you do `df[df[col].apply(is_ascii) ==False]` then you get only the rows/indices where and error was found. – dreab Dec 19 '17 at 10:44
1

Another solution is to use string functions encode/decode with the 'ignore' option, but it will remove non-ascii characters:

df['text'] = df['text'].apply(lambda x: x.encode('ascii', 'ignore').decode('ascii'))

Edward Weinert
  • 523
  • 1
  • 5
  • 9
1

Try this, it works

newdf.to_csv('filename.csv', encoding='utf-8')

Sumit Shrestha
  • 356
  • 3
  • 6
0

When I read csv file with latin characters such as: á, é, í, ó, ú, ñ, etc. my solution is to use: encoding='latin_1'

df = pd.read_csv('filename.txt',sep='|', encoding='latin_1')
<do stuff>
newdf.to_csv('output.txt', sep='|', index=False, encoding='latin_1')

You can read a complete list in this documentation: [List of Python standard encodings][1].