1

I'm running git commands from with a Google Colab (I mounted a google drive containing a git repo into the colab). All commands worked without a problem, and suddenly some commands stopped working.

These commands still work in my Colab:

!git branch
!git stash pop
!git log -1

But these commands produce an error in my Colab that wasn't occurring before:

!git status
!git checkout master
!git pull origin master

Output of

!git stash pop 
!git status
No stash entries found.
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-27-7155109ce28e> in <module>()
     12 get_ipython().system('git stash pop')
     13 # !git branch
---> 14 get_ipython().system('git status')
     15 # !git pull origin master
     16 # if os.path.isdir(WORKING_DIR):

5 frames
/usr/local/lib/python3.6/dist-packages/google/colab/_shell.py in system(self, *args, **kwargs)
    100       kwargs.update({'also_return_output': True})
    101 
--> 102     output = _system_commands._system_compat(self, *args, **kwargs)  # pylint:disable=protected-access
    103 
    104     if pip_warn:

/usr/local/lib/python3.6/dist-packages/google/colab/_system_commands.py in _system_compat(shell, cmd, also_return_output)
    438   # stack.
    439   result = _run_command(
--> 440       shell.var_expand(cmd, depth=2), clear_streamed_output=False)
    441   shell.user_ns['_exit_code'] = result.returncode
    442   if -result.returncode in _INTERRUPTED_SIGNALS:

/usr/local/lib/python3.6/dist-packages/google/colab/_system_commands.py in _run_command(cmd, clear_streamed_output)
    193       os.close(child_pty)
    194 
--> 195       return _monitor_process(parent_pty, epoll, p, cmd, update_stdin_widget)
    196   finally:
    197     epoll.close()

/usr/local/lib/python3.6/dist-packages/google/colab/_system_commands.py in _monitor_process(parent_pty, epoll, p, cmd, update_stdin_widget)
    220   while True:
    221     try:
--> 222       result = _poll_process(parent_pty, epoll, p, cmd, decoder, state)
    223       if result is not None:
    224         return result

/usr/local/lib/python3.6/dist-packages/google/colab/_system_commands.py in _poll_process(parent_pty, epoll, p, cmd, decoder, state)
    273       output_available = True
    274       raw_contents = os.read(parent_pty, _PTY_READ_MAX_BYTES_FOR_TEST)
--> 275       decoded_contents = decoder.decode(raw_contents)
    276 
    277       sys.stdout.write(decoded_contents)

/usr/lib/python3.6/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfa in position 21: invalid start byte

As you can see, !git stash pop works but not !git status. Note that all git commands worked fine in my Colab until recently. It seems like something happened in the Colab that made it reject some git commands.

Yes, I've tried creating a new Colab and rewriting the offending the offending git commands there, and the error is still present. I've also tried deleting every character and rewrote them from scratch (in case there were hidden characters causing the issue).

On my local machine, the commands work fine and don't print any weird characters:

$ git status
On branch tf2
nothing to commit, working tree clean

Any thoughts?

Eric
  • 16,003
  • 15
  • 87
  • 139

2 Answers2

2

First, read https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ so that the rest of this answer will make sense.

Now, the problem here is that you have told different subsystems—Python and Git, in this case—to use different encodings. Specifically, you have told Python to expect UTF-8 encoded data. Your Git is probably not using any particular encoding at all and just copying raw bytes through. Normally, your git status output would use an octal escape encoding (see, e.g., How to make Git properly display UTF-8 encoded pathnames in the console window? and Git gets confused with ä in file name). Other commands don't do any special encoding (see, e.g., Make git diff show UTF8 encoded characters properly) and git status can be told how to behave. The --porcelain=v2 and -z options are particularly useful in git status, though you must then rewrite your own Python code to expect byte sequences.

You may need to figure out what the actual encoding used is, in the underlying file system. If you do not wish to deal with this sort of problem, make sure all your file names use simple ASCII characters: no files named schön, for instance.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Thank you for the detailed response. Remember I'm not running this on my local machine, I'm running this on a [Google Colab](https://colab.research.google.com/) where I have very little control over the Python subsystem and where the git system is preinstalled. So I'm not quite sure how your answer is relevant. Also, why would some git commands work and not others? – Eric Dec 11 '20 at 11:24
  • 1
    If you have no control, you are SOL. If you have control, either use Python 2 (so that all the encoding and decoding stuff goes away) or change the code. You have some existing files whose names have accented or non-ASCII characters in them (`0xfa` in Windows CP1252 represents `ú`, in particular, although I have no idea if you're using CP1252 here). – torek Dec 11 '20 at 11:37
  • The commands that work are the ones that aren't printing this character. The commands that don't work are the ones that are. – torek Dec 11 '20 at 11:38
  • I tried running the commands on my local machine and they don't print any potentially offending characters (see my updated question). Also, I was able to change the runtime to Python 2, and I still get the same error. – Eric Dec 14 '20 at 11:24
  • Curious. Your traceback now shows python2 paths (rather than /usr/local/lib/python3.6)? – torek Dec 14 '20 at 11:43
  • Looks like the issue was elsewhere (corrupted git index file). See my answer. Thanks anyway for your help. – Eric Dec 14 '20 at 16:50
  • It's definitely worth figuring out *why* you had a corrupted index. (That's where the bad file name(s) came from, since they're stored in the index.) – torek Dec 14 '20 at 20:16
0

I tried manipulating the repo using GitPython, and got a different error:

Error: bad index – Fatal: index file corrupt”

Which indicates that the issue is down to a corrupt git index file. I solved it in my Colab by doing the following:

rm -f .git/index
git reset

Thanks to this answer.

Eric
  • 16,003
  • 15
  • 87
  • 139