1

So, I am working on a GitHub repo that I plan to publish as a Python package. I accidentally added and committed a couple of large data files (for testing) to the repository. I then removed the file using a later commit.

My main concern has to do with the fact that this data file is very very large.

My question is two-fold:

  1. If someone clones my repo now, will the size of the download include the large data file, because git would want that file on the local system in case someone tries to revert to an older commit?
  2. When I publish this as a Python package, will the installer also similarly download this large data file (irrespective of if it is referred to or not)?

I suspect the answer to 1) is Yes, but to 2) is No, but I am not sure. If either answer is Yes, how do I fix this?

  • 1
    Your suspicion is correct: git will have the files, python packages probably don't (if built from a commit that doesn't specifically include them). To remove them fully, you'll need to rewrite the git history (which can lead to problems, but is doable). See [this question](https://stackoverflow.com/questions/43762338/how-to-remove-file-from-git-history) for details on removing files from the git history. If the files were only added in recent commits, then a rebase with a force-push is probably the easiest solution. – Joachim Sauer Jun 16 '21 at 15:00
  • To add to the previous comment - the problems are mostly if someone has already cloned the repository; they'll need to reset or rebase their branches after the rewriting. This gets increasingly problematic as the number of people involved grows... If the only person using the repo is you and nobody else, there's no problem rewriting history. – Jiří Baum Jun 16 '21 at 15:03
  • I would follow the process of re-writting your commit history. This is considered a "dangerous" move, but likely appropriate in this case (to remove the large files from the history of the repo). If there is a large number of developers using that repo's branch it may not be an option. But with a small number of developers it should be fairly doable. It will require the `git push -f` when rewriting history. – benhorgen Jun 16 '21 at 15:45

1 Answers1

0
  1. Yes, cloning will download all deleted files.

By default, git clone downloads the complete repository, every version of the file. To truly scrub away the old files you need to remove them from the history using a tool such as the BFG Repo-Cleaner. See Github's help article Removing sensitive data from a repository.

You can store large files in your repository without bloating the repository size using a tool called Git Large File Storage (git-lfs). See Github's help article Managing large files for how to use it.

  1. Depends on how your installer works.

If it does a git clone of the whole repository, then yes. But if it's smart about it will only clone the latest version.

Schwern
  • 153,029
  • 25
  • 195
  • 336