I have a Local Repo that has a bunch of commits, lets say 10, but the first/Initial commit contains all the files in the local repository and shows the file state of each file (raw data of each file). I have a requirement to remove the potentially sensitive data from Jupyter Notebooks and I am doing so with a pre-commit hook leveraging the nbconvert python Package. This cleans out the future commits however the initial commit of the file addition to the local repo still contains the Jupyter Notebook output JSON tags.
- I have currently the .gitattributes file as:
*.ipynb filter=strip-notebook-output
- I have the .git/config file as:
[filter "strip-notebook-output"]\n\tclean = "jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to=notebook --stdin --stdout --log-level=ERROR"
Because this has to be done through an API and Python interactions with the command line I don't seem to be able to remove the initial file commit where it shows the raw file and all its content which works contradictory to the pre-commit hook if the file has this data when it is added (which is a requirement).
I also use this: git filter-branch --tree-filter "python3 -m nbconvert --ClearOutputPreprocessor.enabled=True --inplace *.ipynb **/*.ipynb || true"
command to clean out pre-existing historical commits of Jupyter Notebooks but it doesn't seem to work on this initial commit either however it does work for other commits that might have Jupyter Notebook output within. I am at a loss really for how to progress this.
What I need:
- Maintain the .ipynb file in local as is.
- Commit the .ipynb file to a remote repository without the output data of cells.
- Have a clean git commit history such that no where in the history of commits is this output data stored.
If this is not possible or I am missing something or doing something wrong please let me know :)