4

My company is using a database and I am writing a script that interacts with that database. There is already an script for putting the query on database and based on the query that script will return results from database.

I am working on unix environment and I am using that script in my script for getting some data from database and I am redirecting the result from the query to a file. Now when I try to read this file then I am getting an error saying-

UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 9741: ordinal not in range(128)

I know that python is not able to read file because of the encoding of the file. The encoding of the file is not ascii that's why the error is coming. I tried checking the encoding of the file and tried reading the file with its own encoding.

The code that I am using is-

 os.system("Query.pl \"select title from bug where (ste='KGF-A' AND ( status = 'Not_Approved')) \">patchlet.txt")
 encoding_dict3={}
 encoding_dict3=chardet.detect(open("patchlet.txt", "rb").read())
 print(encoding_dict3)
# Open the patchlet.txt file for storing the last part of titles for latest ACF in a list
 with codecs.open("patchlet.txt",encoding='{}'.format(encoding_dict3['encoding'])) as csvFile
readCSV = csv.reader(csvFile,delimiter=":")
    for row in readCSV:
        if len(row)!=0:
            if len(row) > 1:
                j=len(row)-1
                patchlets_in_latest.append(row[j])
            elif len(row) ==1:
                patchlets_in_latest.append(row[0])               
patchlets_in_latest_list=[]
# calling the strip_list_noempty function for removing newline and whitespace characters
patchlets_in_latest_list=strip_list_noempty(patchlets_in_latest)
# coverting list of titles in set to remove any duplicate entry if present
patchlets_in_latest_set= set(patchlets_in_latest_list)
# Finding duplicate entries in  list
duplicates_in_latest=[k for k,v in Counter(patchlets_in_latest_list).items() if v>1]
# Printing imp info for logs
    print("list of titles of patchlets in latest list are : ")
for i in patchlets_in_latest_list:
   **print(str(i))**
print("No of patchlets in latest list are : {}".format(str(len(patchlets_in_latest_list))))

Where Query.pl is the perl script that is written to bring in the result of query from database.The encoding that I am getting for "patchlet.txt" (the file used for storing result from HSD) is:

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

Even when I have provided the same encoding for reading the file, then also I am getting the error.

Please help me in resolving this error.

EDIT: I am using python3.6

EDIT2:

While outputting the result I am getting the error and there is one line in the file which is having some unknown character. The line looks like:

Some failure because of which vtrace cannot be used along with some trace.

I am using gvim and in gvim the "vtrace" looks like "~Vvtrace" . Then I checked on database manually for this character and the character is "–" which is according to my keyboard is neither hyphen nor underscore.These kinds of characters are creating the problem.

Also I am working on linux environment.

EDIT 3: I have added more code that can help in tracing the error. Also I have highlighted a "print" statement (print(str(i))) where I am getting the error.

Community
  • 1
  • 1
rikki
  • 431
  • 1
  • 8
  • 18
  • `encoding='windows-1252'` (note the lower case'w') or `encoding='cp1252'` ought to work - see [codec names and aliases](https://docs.python.org/3/library/codecs.html#standard-encodings) – snakecharmerb Feb 04 '19 at 08:37
  • No, both are not working, Still getting the same error-@ snakecharmerb – rikki Feb 04 '19 at 08:44
  • 1
    Are you able to share an [mcve], and let us know which version of python that you are running? And a traceback? – snakecharmerb Feb 04 '19 at 08:48
  • check this: https://stackoverflow.com/a/19256955/2987755, https://stackoverflow.com/a/5387966/2987755, https://github.com/jdunck/python-unicodecsv – dkb Feb 04 '19 at 10:13
  • 2
    I'm guessing the problem is happening when you are outputing your results, not when reading the input. But it isn't possible to do more than guess without some code and data that reproduces the problem, or at least the code and a traceback. – snakecharmerb Feb 04 '19 at 19:05
  • I have added all the possible information . Please look into the question now-snakecharmerb – rikki Feb 05 '19 at 11:05
  • Possible duplicate of [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) – tripleee Feb 05 '19 at 11:40
  • 1
    Again, the traceback comes from code you have not shown. Something after the `with codecs.open` is probably trying to simply `print` to a place where Python cannot determine a correct encoding. – tripleee Feb 05 '19 at 11:41
  • As an aside, `encoding='{}'.format(encoding_dict3['encoding'])` is a really roundabout way of saying `encoding=encoding_dict3['encoding']` – tripleee Feb 05 '19 at 11:42
  • Added more code which will help in tracing back the script and indicated line where It is showing error on terminal . Hope that It will help - @ tripleee – rikki Feb 07 '19 at 09:28
  • @rikki when you post a question about an exception, always post the full traceback (the one you got reunning the exact code you posted so line numbers etc match) - the error message itself is useless if you don't at least know where the error happens. – bruno desthuilliers Feb 07 '19 at 09:33
  • Given where you get the error, the issue has nothing to do with reading the file, it's about your `sys.stdout` not being configured for this encoding. – bruno desthuilliers Feb 07 '19 at 09:37
  • By printing print (sys.version_info) I am getting "sys.version_info(major=3, minor=6, micro=3, releaselevel='final', serial=0)" -@ snakecharmerb – rikki Feb 08 '19 at 08:57

1 Answers1

6

Problem

Based on the information in the question, the program is processing non-ASCII input data, but is unable to output non-ASCII data.

Specifically, this code:

for i in patchlets_in_latest_list:
   print(str(i))

Results in this exception:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2013'

This behaviour was common in Python2, where calling str on a unicode object would cause Python to try to encode the object as ASCII, resulting in a UnicodeEncodeError if the object contained non-ASCII characters.

In Python3, calling str on a str instance doesn't trigger any encoding. However calling the print function on a str will encode the str to sys.stdout.encoding. sys.stdout.encoding defaults to that returned by locale.getpreferredencoding. This will generally be your linux user's LANG environment variable.

Solution

If we assume that your program is not overriding normal encoding behaviour, the problem should be fixed by ensuring that the code is being executed by a Python3 interpreter in a UTF-8 locale.

  • be 100% certain that the code is being executed by a Python3 interpreter - print sys.version_info from within the program.
  • try setting the PYTHONIOENCODING environment variable when running your script: PYTHONIOENCODING=UTF-8 python3 myscript.py
  • check your locale using the locale command in the terminal (or echo $LANG). If it doesn't end in UTF-8, consider changing it. Consult your system administrators if you are on a corporate machine.
  • if your code runs in a cron job, bear in mind that cron jobs often run with the 'C' or 'POSIX' locale - which could be using ASCII encoding - unless a locale is explicitly set. Likewise if the script is run under a different user, check their locale settings.

Workaround

If changing the environment is not feasible, you can workaround the problem in Python by encoding to ASCII with an error handler, then decoding back to str.

There are four useful error handlers in your particular situation, their effects are demonstrated with this code:

>>> s = 'Hello \u2013 World'
>>> s
'Hello – World'
>>> handlers = ['ignore', 'replace', 'xmlcharrefreplace', 'namereplace']
>>> print(str(s))
Hello – World
>>> for h in handlers:
...     print(f'Handler: {h}:', s.encode('ascii', errors=h).decode('ascii'))
... 
Handler: ignore: Hello  World
Handler: replace: Hello ? World
Handler: xmlcharrefreplace: Hello – World
Handler: namereplace: Hello \N{EN DASH} World

The ignore and replace handlers lose information - you can't tell what character has been replaced with an space or question mark.

The xmlcharrefreplace and namereplace handlers do not lose information, but the replacement sequences may make the text less readable to humans.

It's up to you to decide which tradeoff is acceptable for the consumers of your program's output.

If you decided to use the replace handler, you would change your code like this:

for i in patchlets_in_latest_list:
    replaced = i.encode('ascii', errors='replace').decode('ascii')
    print(replaced)

wherever you are printing data that might contain non-ASCII characters.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • hi....thank you very much for such an amazing explanatory answer. It has completely resolved the issue....loads of thanks - @ snakecharmerb – rikki Feb 09 '19 at 17:45
  • THANK YOU!!! More keywords for this post: Docker Python Ubuntu, "print() doesn't work right in iPython" – Tony Fraser Sep 11 '20 at 01:21