1

I am building a Python package. This package consists of some scripts and several datafiles (~500 Mb) stored as small csv files. I use setuptools and I track the development of the package on Gitlab.

From time to time, I need update the csv files. Crucially, I always replace all of them at the same time. The problem is that every time that I do so, the size of the Gitlab repo and the python package increases insanely, because git keeps version control of the files.

I was wondering if you have some suggestion on the best practices in such a case and how to keep the package reasonably small in particular. Is git lfs the best option?

sinoroc
  • 18,409
  • 2
  • 39
  • 70
user64898
  • 11
  • 1
  • That the size of a git repository increases with every single commit, is perfectly normal, expected, and out of your control. But that the size of the Python package increases with every release is only on you and your project. -- Are you by chance (by mistake) adding the `.git` directory into the sdist and/or wheel distributions of your project? – sinoroc Oct 18 '22 at 10:15
  • But when I install the package with `pip git+ssh://git@myrepo/reponame.git` it downloads the entire (big) repo, right? This takes a really long time. – user64898 Oct 18 '22 at 11:30
  • I do not know for sure. It seems like [`pip` does not do a "shallow" clone](https://stackoverflow.com/a/52989760), if it is indeed the case, then I guess that yes `pip` probably downloads the entire repository. -- Installing from `git` should definitely not be the preferred way of installing things. -- In your case I guess I would look into finding a way to keep the data files separately from the code and work on better packaging and distribution processes. – sinoroc Oct 18 '22 at 12:17

1 Answers1

1

repo and the python package increases insanely, because git keeps version control of the files

Partially wrong logic, because

  1. Package have to include only HEAD (?) version of CSV, thus - have to have minimal impact on size
  2. Repo-size has to be increased with every commit, but Git stores only compressed deltas in changesets (in simple and short explanation it's definitely true, if do not dive into implementation details), and deltas of text-files (and CSV must be texts) are small, even if total size of CSV is noticeable

Hints:

  • Check content of package, exclude unwanted artifacts
  • Check (local, existing in your environment) Content-Type of *.csv
  • Git-LFS can decrease (after correct migration) size of repository, but can't (have not) affect size of package
Lazy Badger
  • 94,711
  • 9
  • 78
  • 110