27

I am getting the following error while executing the below code snippet exactly at the line if uID in repo.git.log():, the problem is in repo.git.log(), I have looked at all the similar questions on Stack Overflow which suggests to use decode("utf-8").

how do I convert repo.git.log() into decode("utf-8")?

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 377826: invalid start byte 

Relavant code:

..................
uID = gerritInfo['id'].decode("utf-8")                                            
if uID in repo.git.log():
        inwslist.append(gerritpatch)      
.....................


Traceback (most recent call last):
  File "/prj/host_script/script.py", line 1417, in <module>
    result=main()
  File "/prj/host_script/script.py", line 1028, in main
    if uID in repo.git.log():
  File "/usr/local/lib/python2.7/dist-packages/git/cmd.py", line 431, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/git/cmd.py", line 802, in _call_process
    return self.execute(make_call(), **_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/git/cmd.py", line 610, in execute
    stdout_value = stdout_value.decode(defenc)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 377826: invalid start byte
dreftymac
  • 31,404
  • 26
  • 119
  • 182
  • 1
    What Python module provides the `repo.git.log()` call? What version of Python is this? What is the **full** traceback of the exception? – Martijn Pieters Apr 02 '15 at 17:55
  • @MartijnPieters - it is happening in Python 2.7.3 and Python 2.7.5,updated the quesiton with full traceback –  Apr 02 '15 at 18:00
  • The error happens way down in the bowels of the `repo.git.log()` command, I think because the output produced by the `git` command doesn't produce UTF-8. That could be because the git log contains non-UTF-8 data, or for a different reason. I don't know what project provides the `git` package however. – Martijn Pieters Apr 02 '15 at 18:03
  • @MartijnPieters - can we do something like repo.git.log = repo.git.log(encode('utf-8')) ..or strip of the right single quotation mark..what would be hte syntax for that? –  Apr 02 '15 at 18:13
  • You can't, no. I don't know what hit command is run or what influences the codecs used here. – Martijn Pieters Apr 02 '15 at 18:28
  • I don't know what `git.py` you have but in `stdout_value = stdout_value.decode(defenc)` that `defenc` is interesting. The name suggests "default encoding", so there appears to be a knob you can turn to set that `git.py` to expect different encodings for different commit messages. – torek Apr 02 '15 at 18:34
  • what would be the value inplace of defenc to convert to utf6 codec –  Apr 02 '15 at 19:34
  • @torek: it just picks the system default: https://github.com/gitpython-developers/GitPython/blob/master/git/compat.py#L26 – Martijn Pieters Apr 02 '15 at 21:20
  • According to the [git documenation](http://git-scm.com/docs/git-log) git does not enforce any encodings, but UTF-8 is 'preferred'. The same documentation also shows that it is possible to configure git to output a specific codec; `i18n.logoutputencoding` specifically. – Martijn Pieters Apr 02 '15 at 21:24
  • @MartijnPieters - any idea who to change the repo.git.log() to output utf-8 format? –  Apr 03 '15 at 03:01

4 Answers4

42

Use encoding='cp1252' will solve the issue.

Abdul Rehman
  • 5,326
  • 9
  • 77
  • 150
23

0x92 is a smart quote(’) of Windows-1252. It simply doesn't exist in unicode, therefore it can't be decoded.

Maybe your file was edited by a Windows machine which basically caused this problem?

Smart Manoj
  • 5,230
  • 4
  • 34
  • 59
Exceen
  • 765
  • 1
  • 4
  • 20
  • thats my question too,how to make sure repo.git.log() has the right character encoding to get rid of this error? –  Apr 02 '15 at 18:03
  • 13
    0x92 in Windows-1252 is simply one encoding for the [U+2019 RIGHT SINGLE QUOTATION MARK](http://codepoints.net/U+2019) codepoint in Unicode. To state that it doesn't exist in Unicode is.. incorrect; in UTF8 it should be encoded as 0xE2 0x80 0x99. If this *is* about data encoded as CP-1252 then that just means there is an encoding error somewhere. – Martijn Pieters Apr 02 '15 at 18:06
1

After good research, I got the solution. In my case, datadump.json file was having the issue.

  • Simply Open the file in notepad format
  • Click on save as option
  • Go to encoding section below & Click on "UTF-8"
  • Save the file.

Now you can try running the command. You are good to go :)

For your reference, I have attached images below.

Step1

Step2

Step3

0

0x92 does not exist in the encoding UTF-8. As Exceen stated in his answer 0x92 is used in Windows-1252 as a smart quote. The way to resolve this is to use the windows 1252 encoding or to update the smart quote to a normal quote.

Jake
  • 1