2

I want to remove all the files in a branch that a specific user did not edit.

What's the most robust way to do that? I was hoping there would be a git command for it but I'm thinking I might have to write a program.

jbrahy
  • 4,228
  • 1
  • 42
  • 54
  • What do you mean "remove all the files in a branch": do you want to remove *all modifications to those files and keep them untouched as if they were never changed in the branch* or do you want to *delete those files and commit that deletion*? – Joachim Sauer Feb 01 '23 at 19:31
  • 2
    @JoachimSauer, I guess "all modifications to those files and keep them untouched as if they were never changed in the branch" is not an option, b/c jbrahy wants to delete files which were **not** edited by a user. – kosist Feb 01 '23 at 19:44
  • 2
    @kosist: it's possible if multiple users commited on that branch. They said "a specific user" not "a user". – Joachim Sauer Feb 01 '23 at 19:50
  • I fear you must assemble sth akin to [Get all files that have been modified in git branch](https://stackoverflow.com/a/10641810/2375855) + [Remove all files except some from a directory](https://stackoverflow.com/a/4325357/2375855) – ojdo Feb 01 '23 at 20:03
  • 1
    @kosist you have it right. I have a branch and I need to find all the files that were touched by a user and remove all the other files in the branch. – jbrahy Feb 01 '23 at 20:34
  • Are you wishing to rewrite history or just make a new commit that deletes some files that match your criteria? – TTT Feb 01 '23 at 23:34
  • A new commit with only the specific files that this user touched is what I'm looking for. – jbrahy Feb 02 '23 at 01:07
  • I'm thinking something along the lines of `git log --name-only --author=User` but that won't aggregate the files for you; you'd need to build the set per commit. Also, you might need to think about how you want to handle merge commits. If the user merged in a bunch of commits that were authored by someone else, should the files touched in those commits be included or not? (Because the merge commit "touched" them...) – TTT Feb 02 '23 at 05:51
  • I need to include any file the user edited. Not just merged but changed contents between commits. – jbrahy Feb 02 '23 at 20:24
  • Files aren't "in" branches, they're in commits. Commits aren't "in" any particular branch, branch names are repo-local temporary labels on specific commits. So "remove files from a branch" doesn't express any particular concrete meaning, it's a couldn't-be-more-vague general characterization that could fit so many possibilities it's hard to know even where to start asking or what specifically you're getting at. – jthill Feb 04 '23 at 00:55
  • I'm talking about all the files in all the commits in a branch. – jbrahy Feb 04 '23 at 11:31
  • 2
    Please edit the question to explain why. Generally, please **always** update a question to address comments. This sounds like the middle of a conversation which started earlier along the lines of "My pull request has other people's stuff in it" or similar. – AD7six Feb 05 '23 at 15:00

2 Answers2

4

disclaimer :

  1. git does not have a way to track the history of a single file, actions such as "this files was moved" can only be guessed after the facts by comparing the content of files in the history

so you may have issues linking a moved file to its correct author

  1. in standard git, a commit can be linked to 2 users: one author and one committer. The author and committer can be different, for example: if Lisa ran a rebase and moved commits authored by Franck, some commits would have Franck as an author and Lisa as a committer
  2. other platforms, such as github, also have a way to represent co-authors (have a "Co-authored-by: name " line in the commit message)

depending on your intention regarding the notion of "edited by a user", you may want to use any combination of fields to spot "that's a commit on which he worked"

  1. it is pretty common to rework the history (regroup commits, split them differently) before pushing to the production branch

the list of "authors" and "committers" in your repo history may not accurately indicate who really edited those files.


You can do it in two steps:

Step 1: extract a list of files edited by the user

Here is one way to list all the files that appeared as "added" or "modified" by a given author in your repo :

$ git log --author="<NAME OR EMAIL>" --pretty="format:" \
          --diff-filter=AM --name-only --all | sort -u

you can store that list of files on disk: $ git log --author... | sort -u > /tmp/authored.txt

Step 2: once you have the list of files to keep, you can use git filter-repo to extract the part of the history that touches only these files

# work on a fresh clone of your repo:
git clone repo myclone
cd myclone

git filter-repo --paths-from-file /tmp/authored.txt

# the history of the 'myclone' now contains only files listed in /tmp/authored.txt

Further points:

as said in the "disclaimer" section, depending on what you intend with "edited", you may want to list more files in "Step 1":

  • git log --committer="<NAME>"
  • git log --grep="Co-Authored-by:.*[Ff]red"

note that you can run as many different git log commands as you want to extract file names, you can always sort -u the combined result in the end.

This solution does not try to be smart about renamings, please update your question if you have an explicit need for that.

LeGEC
  • 46,477
  • 5
  • 57
  • 104
3

try this

#!/bin/sh

# set the user name, only the first name, check the usernae by trying a git blame command
user="<username>"

# filter the files, which you need to check
file_filter=".java"

# get a list of all files in the branch
files=$(find . |grep $file_filter)

# loop through each file
for file in $files; do
  if [ -f $file ]; then
    if git blame "$file" >/dev/null 2>&1; then  
    # use git blame to determine the author of each line in the file
    author=$(git blame $file | awk -v user="$user" '$2 ~ user {sub(/^./, "", $2); print $2}')
    # echo $author
    # if the user did not edit any lines in the file, remove it
    if [ -z "$author" ]; then     
      echo "Not edited by user - $file"
      git rm $file            
    else      
      echo "Edited by user - $file"
    fi
    fi
  fi
done

If you need to delete the changed files only in this branch, you can do as below

#!/bin/sh

# set the user name, only the first name, check the usernae by trying a git blame command
user="<user>"

# if changed files only
current_branch=$(git rev-parse --abbrev-ref HEAD)
main_branbch="<main branch>"
merge_base_commit=$(git merge-base $current_branch $main_branbch)
files=$(git diff --name-only $merge_base_commit HEAD )

# loop through each file
for file in $files; do
  # use git blame to determine the author of each line in the file
  author=$(git blame $file | awk -v user="$user" '$2 ~ user {sub(/^./, "", $2); print $2}')

  # if the user did not edit any lines in the file, remove it
  if [ -z "$author" ]; then     
    echo "Not edited by user - $file"
    git rm $file            
  else      
    echo "Edited by user - $file"
  fi
done
Chandika
  • 93
  • 1
  • 8
  • You can limit some of the logic/churn here by 1) determining the merge-base commit between the branch and master/main/whatever 2) identifying files modified since that commit, deleting everything not mentioned 3) then looping through only files modified in the branch. Pointing this out because as is it'll not-delete files that the user touched at any time in their history. Pro tip: make the script output `git rm` as text - then you can iterate on it without destroying your repo, and _then_ execute the output once you're confident it does the right thing :). +1 anyway – AD7six Feb 05 '23 at 15:06
  • yes, you are right, I did not think on that perspective, I'll change the script if need to delete the changed files only – Chandika Feb 05 '23 at 15:33