UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 377826: invalid start byte

Question

I am getting the following error while executing the below code snippet exactly at the line if uID in repo.git.log():, the problem is in repo.git.log(), I have looked at all the similar questions on Stack Overflow which suggests to use decode("utf-8").

how do I convert repo.git.log() into decode("utf-8")?

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 377826: invalid start byte

Relavant code:

..................
uID = gerritInfo['id'].decode("utf-8")                                            
if uID in repo.git.log():
        inwslist.append(gerritpatch)      
.....................


Traceback (most recent call last):
  File "/prj/host_script/script.py", line 1417, in <module>
    result=main()
  File "/prj/host_script/script.py", line 1028, in main
    if uID in repo.git.log():
  File "/usr/local/lib/python2.7/dist-packages/git/cmd.py", line 431, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/git/cmd.py", line 802, in _call_process
    return self.execute(make_call(), **_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/git/cmd.py", line 610, in execute
    stdout_value = stdout_value.decode(defenc)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 377826: invalid start byte

What Python module provides the `repo.git.log()` call? What version of Python is this? What is the **full** traceback of the exception? — Martijn Pieters, Apr 02 '15 at 17:55
@MartijnPieters - it is happening in Python 2.7.3 and Python 2.7.5,updated the quesiton with full traceback — , Apr 02 '15 at 18:00
The error happens way down in the bowels of the `repo.git.log()` command, I think because the output produced by the `git` command doesn't produce UTF-8. That could be because the git log contains non-UTF-8 data, or for a different reason. I don't know what project provides the `git` package however. — Martijn Pieters, Apr 02 '15 at 18:03
@MartijnPieters - can we do something like repo.git.log = repo.git.log(encode('utf-8')) ..or strip of the right single quotation mark..what would be hte syntax for that? — , Apr 02 '15 at 18:13
You can't, no. I don't know what hit command is run or what influences the codecs used here. — Martijn Pieters, Apr 02 '15 at 18:28
I don't know what `git.py` you have but in `stdout_value = stdout_value.decode(defenc)` that `defenc` is interesting. The name suggests "default encoding", so there appears to be a knob you can turn to set that `git.py` to expect different encodings for different commit messages. — torek, Apr 02 '15 at 18:34
what would be the value inplace of defenc to convert to utf6 codec — , Apr 02 '15 at 19:34
@torek: it just picks the system default: https://github.com/gitpython-developers/GitPython/blob/master/git/compat.py#L26 — Martijn Pieters, Apr 02 '15 at 21:20
According to the [git documenation](http://git-scm.com/docs/git-log) git does not enforce any encodings, but UTF-8 is 'preferred'. The same documentation also shows that it is possible to configure git to output a specific codec; `i18n.logoutputencoding` specifically. — Martijn Pieters, Apr 02 '15 at 21:24
@MartijnPieters - any idea who to change the repo.git.log() to output utf-8 format? — , Apr 03 '15 at 03:01

Abdul Rehman · Answer 1 · 2018-10-27T00:56:45.370

42

Use encoding='cp1252' will solve the issue.

edited Oct 27 '18 at 00:56

answered May 05 '18 at 04:01

Abdul Rehman

5,326
9
77
150

This solved it for me! As simple as reading in the file with this encoding. – Will J Jul 16 '19 at 04:16

score 23 · Answer 2 · edited Mar 12 '18 at 18:37

23

0x92 is a smart quote(’) of Windows-1252. It simply doesn't exist in unicode, therefore it can't be decoded.

Maybe your file was edited by a Windows machine which basically caused this problem?

edited Mar 12 '18 at 18:37

Smart Manoj

5,230
4
34
59

answered Apr 02 '15 at 18:01

Exceen

765
1
4
20

thats my question too,how to make sure repo.git.log() has the right character encoding to get rid of this error? – Apr 02 '15 at 18:03
13

0x92 in Windows-1252 is simply one encoding for the [U+2019 RIGHT SINGLE QUOTATION MARK](http://codepoints.net/U+2019) codepoint in Unicode. To state that it doesn't exist in Unicode is.. incorrect; in UTF8 it should be encoded as 0xE2 0x80 0x99. If this *is* about data encoded as CP-1252 then that just means there is an encoding error somewhere. – Martijn Pieters Apr 02 '15 at 18:06

score 1 · Answer 3 · answered Jan 29 '22 at 11:03

After good research, I got the solution. In my case, datadump.json file was having the issue.

Simply Open the file in notepad format
Click on save as option
Go to encoding section below & Click on "UTF-8"
Save the file.

Now you can try running the command. You are good to go :)

For your reference, I have attached images below.

Step1

Step2

Step3

score 0 · Answer 4 · answered Aug 25 '22 at 15:07

0

0x92 does not exist in the encoding UTF-8. As Exceen stated in his answer 0x92 is used in Windows-1252 as a smart quote. The way to resolve this is to use the windows 1252 encoding or to update the smart quote to a normal quote.

answered Aug 25 '22 at 15:07

Jake

1

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 377826: invalid start byte

4 Answers4

Linked