I'm currently working on a script that is supposed to go through a cloned Git repo and remove any existing symlinks (broken or not) before recreating them for the user to ease with project initialization.
The solved problems:
- Finding symbolic links (broken or unbroken) in Python is very well documented.
- GitHub/GitLab breaking symlinks upon downloading/cloning/updating repositories is also very well documented as is how to fix this problem. Tl;dr: Git will download symlinks within the repo as plain text files (with no extension) containing only the symlink path if certain
config
flags are not set properly.
The unsolved problem:
My problem is that developers may download this repo without realizing the issues with Git, and end up with the symbolic links "checked out as small plain files that contain the link text" which is completely undetectable (as far as I can tell) when parsing the cloned files/directories (via existing base libraries). Running os.stat()
on one of these broken symlink files returns information as though it were a normal text file:
os.stat_result(st_mode=33206, st_ino=14073748835637675, st_dev=2149440935, st_nlink=1, st_uid=0, st_gid=0, st_size=42, st_atime=1671662717, st_mtime=1671662699, st_ctime=1671662699)
The st_mode
information only indicates that it is a normal text file- 100666
(the first 3 digits are the file type and the last 3 are the UNIX-style permissions). It should show up as 120000
.
The os.path.islink()
function only ever returns False
.
THE CONFUSING PART is that when I run git ls-files -s ./microservices/service/symlink_file
it gives 1200000
as the mode bits which, according to the documentation, indicates that this file is a symlink. However I cannot figure out how to see this information from within Python.
I've tried a bunch of things to try and find and delete existing symlinks. Here's the base method that just finds symlink directories and then deletes them:
def clearsymlinks(cwd: str = ""):
"""
Crawls starting from root directory and deletes all symlinks
within the directory tree.
"""
if not cwd:
cwd = os.getcwd()
print(f"Clearing symlinks - starting from {cwd}")
# Create a queue
cwd_dirs: list[str] = [cwd]
while len(cwd_dirs) > 0:
processing_dir: str = cwd_dirs.pop()
# print(f"Processing {processing_dir}") # Only when debugging - else it's too much output spam
for child_dir in os.listdir(processing_dir):
child_dir_path = os.path.join(processing_dir, child_dir)
# check if current item is a directory
if not os.path.isdir(child_dir_path):
if os.path.islink(child_dir_path):
print(f"-- Found symbolic link file {child_dir_path} - removing.\n")
os.remove(child_dir_path)
# skip the dir checking
continue
# Check if the child dir is a symlink
if os.path.islink(child_dir_path):
print(f"-- Found symlink directory {child_dir_path} - removing.")
os.remove(child_dir_path)
else:
# Add the child dir to the queue
cwd_dirs.append(child_dir_path)
After deleting symlinks I run os.symlink(symlink_src, symlink_dst)
and generally run into the following error:
Traceback (most recent call last):
File "C:\Users\me\my_repo\remakesymlinks.py", line 123, in main
os.symlink(symlink_src, symlink_dst)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\me\\my_repo\\SharedPy' -> 'C:\\Users\\me\\my_repo\\microservices\\service\\symlink_file'
A workaround to specifically this error (inside the create symlink method) is:
try:
os.symlink(symlink_src, symlink_dst)
except FileExistsError:
os.remove(symlink_dst)
os.symlink(symlink_src, symlink_dst)
But this is not ideal because it doesn't prevent a huge list of defunct/broken symlinks from piling up in the directory. I should be able to find any symlinks (working, broken, non-existent, etc.) and then delete them.
I have a list of the symlinks that should be created by my script, but extracting the list of targets from this list is also a workaround that also causes a 'symlink-leak'. Below is how I'm currently finding the broken symlink purely for testing purposes.
if not os.path.isdir(child_dir_path):
if os.path.basename(child_dir_path) in [s.symlink_install_to for s in dirs_to_process]:
print(f"-- Found symlink file {child_dir_path} - removing.")
os.remove(child_dir_path)
# skip the dir checking
continue
A rudimentary solution where I filter for only 'text/plain' files with exactly 1 line (since checking anything else is pointless) and trying to determine whether that single line is just a file path (this seems excessive though):
# Check if Git downloaded the symlink as a plain text file (undetectable broken symlink)
if not os.path.isdir(child_dir_path):
try:
if magic.Magic(mime = True, uncompress = True).from_file(child_dir_path) == 'text/plain':
with open(child_dir_path, 'r') as file:
for i, line in enumerate(file):
if i >= 1:
raise StopIteration
else:
# this function is directly copied from https://stackoverflow.com/a/34102855/8705841
if is_path_exists_or_creatable(line):
print(f"-- Found broken Git link file '{child_dir_path}' - removing.")
print(f"\tContents: \"{line}\"")
# os.remove(child_dir_path)
raise StopIteration
except (StopIteration, UnicodeDecodeError, magic.MagicException):
file.close()
continue
Clearly this solution would require a lot of refactoring (9 indents is pretty ridiculous) if it's the only viable option. Problem with this solution (currently) is that it also tries to delete any single-line files with a string that does not break pathname requirements- i.e. import _virtualenv
, some random test string
, project-name
. Filtering out those files with spaces, or files without slashes, could potentially work but this still feels like chasing edge cases instead of solving the actual problem.
I could potentially rewrite this script in Bash wherein I could, in addition to existing symlink search/destroy code, parse the results from git ls-files -s ...
and then delete any files with the 120000
tag. But is this method feasible and/or reliable? There should be a way to do this from within Python since Bash isn't going to run on every OS.
Note: file names have been redacted/changed after copy-paste for privacy, they shouldn't matter anyways since they are generated dynamically by the path searching functions