You've tagged this with git, but you said:
that merge request was merged to master later ...
Git itself does not have anything called "merge requests" (nor "pull requests" for that matter). Git does have merges and the git merge
command (and also the git request-pull
command, but this merely makes text suitable for sending as part of an email message: it does no merging). GitLab offer something called Merge Requests, and other hosting sites offer things called Pull Requests, but we have to guess what you mean here since you have not mentioned GitLab.
What we can say, about Git in general, is this:
- If the commit(s) with the large files in them exist, they need space for those files.
- If those commit(s) no longer exist, they don't need space for those files.
So the Git answer to whether these:
forever increase the size of the repository
depends on whether those commits still exist. Answering that question is where things get complicated, because MRs and PRs and hosting sites add wrinkles that aren't there in plain Git.
You added this to that last bit:
(master branch).
The branch name isn't really important. Git isn't about branches; it's about commits. We organize our commits into things we call "branches"—the word branch is badly overused in Git, almost to the point where it becomes meaningless, but the collections-of-commits form "branches"—and we use branch names to find specific important commits, by which we can find the earlier commits that make up these collections that we call "branches". But at the end of the day, it's really all about the commits.
Each Git commit:
Is numbered: it has some big ugly hash ID (e.g., a123456
when shortened). This hash ID—or more formally, object ID or OID—is literally how Git finds the commit. Git needs the OID to locate the commit inside its big database of all-of-its-Git-objects.
Is read-only: no part of any commit can ever be changed.
Stores two things: a full snapshot of every file, plus some metadata (information about the commit itself).
The files stored in the commit are stored in a special, read-only, Git-only, compressed and de-duplicated format, so that if there are a million commits but all of them just reuse one file over and over again, there's only one copy of that file in the repository. They can share the files like this because the objects are all read-only like this. Since it's impossible to overwrite that object, you can't change the stored file, so it's safe for all the commits to share the file. But as long as some commit, somewhere in the objects database, has some big file, the repository has the file in it, and therefore that file takes (some) space.1
The metadata in any given commit includes, for Git's own use, the raw hash ID of the parent (or parents, plural) of that commit. This parent linkage forms a backwards-looking chain of commits, which is how "branches" actually work: the branch name gives the raw hash ID of the latest commit, and from that commit, Git works backwards, one hop at a time, from a child commit to its parent. That parent is itself a child of another parent (the "grandparent" of the latest commit), and that parent has another parent, and so on. Traversing this list, backwards, one hop at a time, finds the history in the repository; the commits are the history in the repository, chained together (backwards) via this metadata.
The upshot of all of that is straightforward enough: if you have the latest commit in a Git repository, you also have every earlier commit. That's because a Git commit automatically brings with it its parent, which automatically brings with it another parent, and so on, all the way back to the start of history.2 This is what perivesta means by this comment.
1The amount of space needed for a file depends heavily on how well it compresses. Git has two different kinds of compression, too. As a general rule, these compression techniques work really well on source files made up of human-readable text, not very well on build products, and not at all on pre-compressed files like JPG images or zipped archives or whatever. But that's a separate issue.
2A special kind of clone, a so-called shallow clone, omits certain parents at "shallow graft points". Shallow clones have certain restrictions, so we normally use full (non-shallow) clones. The repository size you're asking about is that of a non-shallow clone in the first place, so we get to ignore this special case.
How we get rid of commits
A Git repository is normally built up over time by adding commits, one by one, like bricks making up a building. We never remove a commit: we just add a new one. Your "commit build products, remove build products, commit again" process made two commits, one of which holds the build products and one of which doesn't. The second of these commits requires the first one because of the way Git stores commits and metadata.
But sometimes we really do want to remove a commit. We can do that—sort of—using the fact that a branch name is defined as "contains the hash ID of the latest commit". When we add commits to a repository, Git works like this:
... <-F <-G <-H <--somebranch
Here, the name somebranch
points to (contains the raw hash ID of) the currently-latest commit H
(the letter H
stands in for some actual, random-looking, big ugly hash ID). Commit H
points backwards to its parent G
(the metadata for H
contains G
's hash ID), G
points backwards to F
, and so on.
When we make a new commit using commit H
as the current commit, Git stores a new snapshot-and-metadata and sets it up so that the new commit points backwards to H
. Then Git stores this new commit's hash ID into the branch name:
... <-F <-G <-H <-I <--somebranch
where I
is our new commit.
To get rid of commit I
, and commit H
too, we can tell Git that it should store the raw hash ID of commit G
into the name somebranch
. When we do that, we get:
H--I ???
/
...--F--G <-- somebranch
The name somebranch
now points to G
, which continues (forever!) to point backwards to F
. Commits H-I
are still in the database but nobody can find them.
Once we do this kind of thing, Git will eventually drop commits H-I
entirely. Precisely when and why is complicated, but the key is that there must be no names by which we can find the hash IDs. As long as no name lets us find commit I
, commit I
will eventually get tossed out. As long as commit I
is being tossed out and it's the only way to find commit H
, commit H
will eventually get tossed out too. Commit G
, on the other hand, is easily found: the name somebranch
finds it. Commit F
is easily found by starting with the name somebranch
, which finds G
, and working backwards. So commits up through and including G
won't be tossed out.
This means that we get rid of commits, in Git, by re-arranging our branch names (and other names as needed: tag names, remote-tracking names, and any other names that might exist in the repository) such that they don't find the commit. Because commits find earlier commits, though, it's absolutely necessary that we strip away all commits after some point, in order to strip away that particular commit. To get rid of commit H
we must get rid of commit I
too!
Now we can talk about the mechanics involved: what commands we use in Git to do this sort of thing.
git reset
and git branch -f
If we just want to move some branch name, we can tell Git move the branch name. There are two commands that do this:
git branch -f
will let you take any branch name and make it point to any commit. You just supply the name and the hash ID. But there's one restriction: you can't have this branch checked out. If you do have the branch checked out, you need the other command.
git reset
will let you move the current branch name to any commit. You just supply the hash ID. The current branch moves to point to that hash ID. You must, however, choose what Git should do with Git's index and your working tree. (This gets complicated but I'll skip all the details.)
The negative part of doing this, of course, is that we lose all the commits after the one we want to drop. If commit H
is bad, but commit I
is good, and we reset or git branch -f
away commit H
, we lose commit I
too.
Rebase
The git rebase
command, especially in its interactive rebase form, lets us re-arrange commits however we like. Now, we already know that it's impossible to change any commit, so that's not what rebase does. Instead, rebase works by copying commits. A rebase operation essentially runs multiple git cherry-pick
commands, with each cherry-pick being a single-commit-copy step. So rebase automates these copies. But it also adds a first and last step.
Suppose we have these commits:
...--F--G--H--I <-- some branch
and we'd like to drop commit G
entirely, but keep H
and I
. To do that, we need to copy H
and I
to new-and-improved commits, which we'll call H'
and I'
. These new commits will have new snapshots and new metadata: new commit H'
will use commit F
as its parent, and new commit I'
will use H'
—the copy of H
—as its parent, like this:
H'-I'
/
...--F--G--H--I <-- somebranch
Once we've made these two copies, git branch -f
(or an equivalent git reset
) can move the name somebranch
to point to I'
:
H'-I' <-- somebranch
/
...--F--G--H--I ???
which we can draw this way:
...--F--H'-I' <-- somebranch
\
G--H--I ???
Git will now—well, eventually—discard the entire G-H-I
chain, since there's no way to find it. When we look at history, we see commits that look exactly like the original H
and I
—they make the same changes as H
and I
did, and have the same log messages—but they have different hash IDs.
Merge vs squash-merge
Besides rebasing commits away, there's one other trick we may often use called a squash merge. This one is workflow-dependent: some people love squash merges, some people hate them, some are kind of indifferent, and a very small number of people actually know what they're doing and use them exactly when they're appropriate.
Suppose we have a chain of commits like this:
...--G--H <-- main
We now develop a new feature or two. I'll draw two features, br1
and br2
for now, which will force a true (non-squash, non-fast-forward) merge:
I--J <-- br1
/
...--G--H <-- main
\
K--L <-- br2
We now decide that we'd like to merge the two feature branches, so we run:
git switch br1 && git merge br2
to merge br2
into br1
. The result, if all goes well, looks like this:
I--J
/ \
...--G--H M <-- br1 (HEAD)
\ /
K--L <-- br2
(main
still exists and still points to H
, I just couldn't draw it any more).
We can now delete the name br2
, and make the name main
point to commit M
and then delete the name br1
as well:
I--J
/ \
...--G--H M <-- main (HEAD)
\ /
K--L
We don't need the other two branch names any more: commit M
, which has two parents, lets us find commits J
and L
both, and those let us find I
and K
, which let us find H
, and so on. So all the history remains intact, and all of it is find-able.
More typically, we might merge one feature branch into main
like this:
I--J <-- feature
/ \
...--G--H------M <-- main (HEAD)
(Doing this in command-line Git requires a flag, git merge --no-ff
, to prevent Git from doing a fast-forward instead of a merge. GitHub's green MERGE button always makes a true merge, as if you always specified --no-ff
. GitLab is probably fancier; I have not used GitLab.)
When we do this kind of merge, we can now delete the name feature
, just as with any other merge. Commits I-J
remain findable because of the merge commit.
But instead of a true merge, we can run git merge --squash
. This makes a new commit that we can draw this way:
I--J <-- feature
/
...--G--H------S <-- main (HEAD)
New commit S
, our squash-merge commit, has only one parent: H
. The snapshot in S
is the same as it would be for merge commit M
, but S
does not connect back to the feature branch tip commit. So if we now delete the name feature
, we get:
I--J ???
/
...--G--H--S <-- main (HEAD)
Commits I
and J
will now eventually go away!
Hence a squash merge might have discarded the build files
If you used a squash merge, rather than a real merge, the original sequence of commits—where you added and then deleted the build files—is no longer relevant on the main line (master
or main
branch). If you've also deleted any names by which you could find those commits, the originals may well be gone by now.
The process of removing stale, unreferenced commits is called garbage collection. You can force a garbage collection with git gc
. However, all commits get a certain grace period (by default at least 14 days) so even if commits are "ready to be removed", it can still take 14 days for a git gc
to remove them. You can speed this up with git gc --prune=now
, but this may not help if there are reflog entries, which generally protect commits for at least 30 days. You can get rid of reflog entries with git reflog expire --expire=now --all
but this removes all safety backup information, so it's generally not a good idea. Mostly you should just let Git do this on its own.
Hosting sites throw in extra wrinkles
When you use a hosting site like GitHub or GitLab and make Pull Requests or Merge Requests, you're also using their add-on databases. These add-on databases may refer to commits, and in order to do so, they may add their own secret (hidden / internal) names for commits that keep them alive.
These kinds of names may prevent some or all commits from ever being garbage collected. Whether, when, and how you can get around this depends on the hosting site.