pandas to_csv: ascii can't encode character

Question

I'm trying to read and write a dataframe to a pipe-delimited file. Some of the characters are non-Roman letters (`, ç, ñ, etc.). But it breaks when I try to write out the accents as ASCII.

df = pd.read_csv('filename.txt',sep='|', encoding='utf-8')
<do stuff>
newdf.to_csv('output.txt', sep='|', index=False, encoding='ascii')

-------

  File "<ipython-input-63-ae528ab37b8f>", line 21, in <module>
    newdf.to_csv(filename,sep='|',index=False, encoding='ascii')

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1344, in to_csv
    formatter.save()

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1551, in save
    self._save()

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1652, in _save
    self._save_chunk(start_i, end_i)

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1678, in _save_chunk
    lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)

  File "pandas\lib.pyx", line 1075, in pandas.lib.write_csv_rows (pandas\lib.c:19767)

UnicodeEncodeError: 'ascii' codec can't encode character '\xb4' in position 7: ordinal not in range(128)

If I change to_csv to have utf-8 encoding, then I can't read the file in properly:

newdf.to_csv('output.txt',sep='|',index=False,encoding='utf-8')
pd.read_csv('output.txt', sep='|')

> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 2: invalid start byte

My goal is to have a pipe-delimited file that retains the accents and special characters.

Also, is there an easy way to figure out which line read_csv is breaking on? Right now I don't know how to get it to show me the bad character(s).

Possible duplicate of [Pandas writing dataframe to CSV file](http://stackoverflow.com/questions/16923281/pandas-writing-dataframe-to-csv-file) — mbecker, Dec 19 '16 at 18:30
Are you normalizing your unicode strings to remove accents? I thought ASCII can't handle those letters... — juanpa.arrivillaga, Dec 19 '16 at 18:35
@juanpa.arrivillaga: I edited my post to clarify what i'm looking for in my output. — ale19, Dec 19 '16 at 18:40
@ale19 you cannot encode accents and special characters in ASCII. It is a bare-bones representation. That is *why* encodings like UTF-8 exist. Just write it out in UTF-8. — juanpa.arrivillaga, Dec 19 '16 at 18:54

Ohad Zadok · Answer 1 · 2020-07-16T18:58:47.223

72

Check the answer here

It's a much simpler solution:

newdf.to_csv('filename.csv', encoding='utf-8')

edited Jul 16 '20 at 18:58

answered May 21 '17 at 13:31

Ohad Zadok

3,452
1
22
26

score 10 · Accepted Answer · edited Apr 09 '18 at 22:40

10

You have some characters that are not ASCII and therefore cannot be encoded as you are trying to do. I would just use utf-8 as suggested in a comment.

To check which lines are causing the issue you can try something like this:

def is_not_ascii(string):
    return string is not None and any([ord(s) >= 128 for s in string])

df[df[col].apply(is_not_ascii)]

You'll need to specify the column col you are testing.

edited Apr 09 '18 at 22:40

Jukurrpa

4,038
7
43
73

answered Dec 19 '16 at 18:47

Alex

12,078
6
64
74

Thanks. When I try your function (specifying the column), I get TypeError: ord() expected a character, but string of length 17 found. I'm assuming this is because ord() checks individual characters, but the column in question contains strings. – ale19 Dec 19 '16 at 19:23
If you do `df[df[col].apply(is_ascii) ==False]` then you get only the rows/indices where and error was found. – dreab Dec 19 '17 at 10:44

score 1 · Answer 3 · answered Sep 09 '19 at 08:06

1

Another solution is to use string functions encode/decode with the 'ignore' option, but it will remove non-ascii characters:

df['text'] = df['text'].apply(lambda x: x.encode('ascii', 'ignore').decode('ascii'))

answered Sep 09 '19 at 08:06

Edward Weinert

523
1
5
9

score 1 · Answer 4 · answered Aug 12 '20 at 04:33

1

Try this, it works

newdf.to_csv('filename.csv', encoding='utf-8')

answered Aug 12 '20 at 04:33

Sumit Shrestha

356
3
6

Hector Chocobar-Torrejon · Answer 5 · 2022-06-30T14:48:33.273

0

When I read csv file with latin characters such as: á, é, í, ó, ú, ñ, etc. my solution is to use: encoding='latin_1'

df = pd.read_csv('filename.txt',sep='|', encoding='latin_1')
<do stuff>
newdf.to_csv('output.txt', sep='|', index=False, encoding='latin_1')

You can read a complete list in this documentation: [List of Python standard encodings][1].

edited Jun 30 '22 at 14:48

answered Jun 30 '22 at 14:32

Hector Chocobar-Torrejon

11
4

pandas to_csv: ascii can't encode character

5 Answers5

Linked

Related