0

I am trying to convince my colleague that using a subprocess for getting a repo head is bad because spawning a subprocess or creating a process has a lot of overhead. To convince him, I created two scripts and profiled them, but the results were not what I was expecting(python-git will be faster than subprocess).

This is the first script - test_git_module.py which I profiled

import git


def test():

    repo = git.Repo(".", search_parent_directories=True)

test()

After profiling this with cProfile - python3 -m cProfile test_git_module -s, the output I got was 78059 function calls (75806 primitive calls) in 0.130 seconds

On the other hand, when I profiled the script test_subprocess.py the output was 6529 function calls (6430 primitive calls) in 0.017 seconds

test_subprocess.py

import subprocess
import os
import sys


def test():

    SELF_DIRPATH = os.path.dirname(__file__)
    WORKSPACE_DIRPATH = (
        subprocess.run(["git", "rev-parse", "--show-toplevel"], stdout=subprocess.PIPE, check=True)
        .stdout.decode(sys.stdout.encoding)
        .strip()
    )

test()

So, clearly in this python-git is not at all helping and it is the one which is really slow for doing such kind of tasks. This brings me to the question that when and why should anyone use Python-GIT over a subprocess?

sid chawla
  • 13
  • 3

2 Answers2

2

Using subprocess has distinct advantages.

  • The subprocess module is part of the standard library.
  • It is a pattern you will encounter very often; not every program has a Python module for it.
  • On modern (especially UNIX-like) systems, creating a process is fast and cheap.

As for parsing output, with git log it is not that hard to shape the output to be easily parsed;

git log --pretty=format:"%h%x09%an%x09%ad%x09%s"

(from this answer) This produces every commit as a single line with the fields separated by tab characters; very easy to transform;

import subprocess as sp

args = ['git', 'log', '--pretty=format:%h%x09%an%x09%ad%x09%s']
commits = [ln.split('\t') for ln in sp.check_output(args, text=True).splitlines()]

Sure there are other progams where processing the output is more difficult. However;

  • Text is a universal interface.
  • This is Python! Data transformation and processing is a core strength of the language.
Roland Smith
  • 42,427
  • 3
  • 64
  • 94
0

git module was not created for the execution speed. Calling shell commands and parsing the outputs makes your code unreadable, hard to maintain, and can be tricky sometimes. Calling python functions rather than subprocess.run is most often more elegant, readable and convenient.

git rev-parse --show-toplevel is a simple output to parse. How about git log? I'm not saying it can't be done, but 95% of your code will be about calling shell and parsing the output rather than your logic. Obviously, you could create a function for each command you need, but that's what git module is already.

It's just like ORM vs bare SQL queries. Most devs prefer ORM for the convenience.

RafalS
  • 5,834
  • 1
  • 20
  • 25
  • As a counterpoint; I actually prefer using SQL queries. For example; they allow you to consider which data transformation is best done in the query or in Python. – Roland Smith May 02 '20 at 08:57