Removing non-ascii characters in a csv file

Question

I am currently inserting data in my django models using csv file. Below is a simple save function that am using:

def save(self):
myfile = file.csv
data = csv.reader(myfile, delimiter=',', quotechar='"')
i=0
for row in data:
    if i == 0:
        i = i + 1
        continue    #skipping the header row        

    b=MyModel()
    b.create_from_csv_row(row) # calls a method to save in models

The function is working perfectly with ascii characters. However, if the csv file has some non-ascii characters then, an error is raised: UnicodeDecodeError 'ascii' codec can't decode byte 0x93 in position 1526: ordinal not in range(128)

My question is: How can i remove non-ascii characters before saving my csv file to avoid this error.

Thanks in advance.

DivinusVox · Accepted Answer · 2013-08-30T15:29:17.083

6

If you really want to strip it, try:

import unicodedata

unicodedata.normalize('NFKD', title).encode('ascii','ignore')

* WARNING THIS WILL MODIFY YOUR DATA * It attempts to find a close match - i.e. ć -> c

Perhaps a better answer is to use unicodecsv instead.

----- EDIT ----- Okay, if you don't care that the data is represented at all, try the following:

# If row references a unicode string
b.create_from_csv_row(row.encode('ascii', 'ignore'))

If row is a collection, not a unicode string, you will need to iterate over the collection to the string level to re-serialize it.

edited Aug 30 '13 at 15:29

answered Aug 29 '13 at 22:57

DivinusVox

1,133
2
12
27

@DivinusVox..Thanks for your answer but i want to completely remove the non-ascii charaters – Njogu Mbau Aug 30 '13 at 08:04
Thanks..got an idea on how to go about it – Njogu Mbau Aug 30 '13 at 18:31

score 3 · Answer 2 · answered Aug 29 '13 at 23:17

3

If you want to remove non-ascii characters from your data then iterate through your data and keep only the ascii.

for item in data:
     if ord(item) <= 128: # 1 - 128 is ascii
          [append,write,print,whatever]

If you want to convert unicode characters to ascii, then the response above by DivinusVox is accurate.

answered Aug 29 '13 at 23:17

jabgibson

416
4
14

Thanks for your answer but I want to completely remove the non-ascii characters in the csv file. On trying using your function, the ord(), an error is raised..ord() expected string of length 1, but list found. Maybe its because each row, contains more than one character(list). BUT, my main issue is how to remove the non-ascii characters in the csv file. – Njogu Mbau Aug 30 '13 at 08:01
@Benarito Is your data just a one dimensional list of strings? – DivinusVox Aug 30 '13 at 15:01
@DivinusVox, Yes, One dimension strings e.g. ,line one, line 2, line 3 – Njogu Mbau Aug 30 '13 at 16:26
@Benarito I suggest @DivinusVox's edited solution. Ascii characters have an integer ord() value between 1 and 128. If you do not want ascii characters, use a if statement to decide if the string contents are `ord(x) <= 128`. I suggest @DivinusVox's edited solution if yo want your data to stay as accurate as possible. – jabgibson Aug 30 '13 at 17:05
Thanks..got an idea on how to go about it – Njogu Mbau Aug 30 '13 at 18:31
1

A single-line, Pythonic way to do the same thing: ''.join([char for char in data if ord(char) <= 128]) – Steve Saporta May 09 '17 at 20:05

score 3 · Answer 3 · answered Aug 30 '13 at 00:19

3

Pandas csv parser (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html) supports different encodings:

import pandas
data = pandas.read_csv(myfile, encoding='utf-8', quotechar='"', delimiter=',')

answered Aug 30 '13 at 00:19

zero323

322,348
103
959
935

Thanks for your answer but My main issue was how to remove the non-ascii characters before saving the file contents.. – Njogu Mbau Aug 30 '13 at 08:02
Thanks..got an idea on how to go about it – Njogu Mbau Aug 30 '13 at 18:31

Removing non-ascii characters in a csv file

3 Answers3

Linked