0

I am having some trouble encoding ascii characters to UTF-8, or a string is not picking up the encoding.

import unicodecsv as csv
import re
import pyodbc
import sys
import unicodedata
    
#!/usr/bin/python
# -*- coding: UTF-8 -*-
    
def remove_non_ascii_1(text):

    text.encode('utf-8')
    
    for i in text:

        return ''.join(i for i in text if i=='£')

In Python 2.7 I get the error

SyntaxError: Non-ASCII character '\xc2' in file on line 16, but no encoding declared; see SyntaxError: Non-ASCII character '\xc2' in file. 

With the Unicode replacement

return ''.join(i for i in text if i=='\xc2')

the error is

UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

Sample text :

row from a csv file reading in

[u'06/11/2020', u'ABC', u'32154E', u'3214', u'DEF', u'Cash Purchase', u'Final', u'', u'20.00%', u'ABC', u'Sold From Pickup', u'New ', u'10.00%', u'0', u'15%', u'\xa3469.84', u'Jonathan Jerome', u'3', u'\xa3140.95', u'2%', u'\xa393.97', u'\xa39,396.83', u'', u'\xa35,638.00', u'30/06/2020', u'4', u'Boiler-Amended']

I want to remove the \xa3 or £ in the currency fields.

Cedric
  • 408
  • 1
  • 4
  • 18
JIH
  • 11
  • 2

2 Answers2

1

First 2 things ahead:

  1. Don't use Python 2 any more because of reason mentioned here!
  2. Don't use different encodings in Python 2.
    TL;DR Python 3 just improved so many things regarding encodings that it simply isn't worth it.
    Whole story: read here

Ok this out of the way let's start fixing your code.

As Klaus D. already mentioned you do not save the result of text. This leads to an encoding warning when comparing seamingly equal characters (£ and £) but one is encoded in the encoding coming from the file you read the other one is encoded in ascii (despite you encoding your code in -*- coding: UTF-8 -*-. This is just to show what your code-file is encoded in, this does not change the behaviour of the interpreter regarding str-parsing).
Edit: Also when comparing to the character you will need to compare to a unicode char so you could either convert it or simply tell the interpreter to encode it as unicode in the first place (that's why I added a leading 'u' in front of your '£')

To fix this simply safe your result into text again after you called text.encode('utf-8').

Additionally the "shebang" and the encoding info should always be on the very top of a file that as soon as you open the file you know what you are dealing with.

Something else I would correct is the first for-loop. this one is unnecessary because you return out of this function anyway after you handled the first element.

This means the completely "corrected" code is this.

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import unicodecsv as csv
import re
import pyodbc
import sys
import unicodedata

def remove_non_ascii_1(text):
    text.encode('utf-8')
    return ''.join(i for i in text if i==u'£')

PS: You should really think again about whether the def remove_non_ascii_1(text) is really necessary. By the looks of it you already input a list of unicode encoded strings which you probably directly read from the file. This means you don't need to correct encoding though the comparison for '£' could stay. You might just want to rename the method ;)

Hope this helped and fixed possible unclarities about Python 2 encodings :D

If you print your list as a whole now you will see it still contains \xca and not the actual '£' but if you print the elements seperately it works fine. This is because the __str__() method of list does not encode unicodes directly but uses the standard ascii encoding...

Cedric
  • 408
  • 1
  • 4
  • 18
0

Python 3 greatly improved unicode text handling. If you have to use Python 2.7, I would recommend using the codecs library when reading text files since it helps you with Pyhton 2.7 unicode issues:

import codecs
fp = codecs.open("file", "r"; encoding="utf-8")

In your case I noticed that you are using unicodecsv as a drop-in csv replacement. In this case, you can hand the parameter encoding="utf-8" when reading the csv file into a list:

r = csv.reader(f, encoding='utf-8')

For just removing non-Ascii characters I would recommend checking this good answer on StackOverflow

ingofreyer
  • 1,086
  • 15
  • 27