0

I'm currently working on a script that is supposed to go through a cloned Git repo and remove any existing symlinks (broken or not) before recreating them for the user to ease with project initialization.

The solved problems:

  • Finding symbolic links (broken or unbroken) in Python is very well documented.
  • GitHub/GitLab breaking symlinks upon downloading/cloning/updating repositories is also very well documented as is how to fix this problem. Tl;dr: Git will download symlinks within the repo as plain text files (with no extension) containing only the symlink path if certain config flags are not set properly.

The unsolved problem:

My problem is that developers may download this repo without realizing the issues with Git, and end up with the symbolic links "checked out as small plain files that contain the link text" which is completely undetectable (as far as I can tell) when parsing the cloned files/directories (via existing base libraries). Running os.stat() on one of these broken symlink files returns information as though it were a normal text file:

os.stat_result(st_mode=33206, st_ino=14073748835637675, st_dev=2149440935, st_nlink=1, st_uid=0, st_gid=0, st_size=42, st_atime=1671662717, st_mtime=1671662699, st_ctime=1671662699)

The st_mode information only indicates that it is a normal text file- 100666 (the first 3 digits are the file type and the last 3 are the UNIX-style permissions). It should show up as 120000.

The os.path.islink() function only ever returns False.

THE CONFUSING PART is that when I run git ls-files -s ./microservices/service/symlink_file it gives 1200000 as the mode bits which, according to the documentation, indicates that this file is a symlink. However I cannot figure out how to see this information from within Python.

I've tried a bunch of things to try and find and delete existing symlinks. Here's the base method that just finds symlink directories and then deletes them:

def clearsymlinks(cwd: str = ""):
    """
    Crawls starting from root directory and deletes all symlinks
    within the directory tree.
    """
    if not cwd:
        cwd = os.getcwd()
    
    print(f"Clearing symlinks - starting from {cwd}")
    # Create a queue
    cwd_dirs: list[str] = [cwd]
    while len(cwd_dirs) > 0:
        processing_dir: str = cwd_dirs.pop()
        # print(f"Processing {processing_dir}")  # Only when debugging - else it's too much output spam
        for child_dir in os.listdir(processing_dir):
            child_dir_path = os.path.join(processing_dir, child_dir)
            
            # check if current item is a directory
            if not os.path.isdir(child_dir_path):
                if os.path.islink(child_dir_path):
                    print(f"-- Found symbolic link file {child_dir_path} - removing.\n")
                    os.remove(child_dir_path)
                
                # skip the dir checking
                continue
            
            # Check if the child dir is a symlink
            if os.path.islink(child_dir_path):
                print(f"-- Found symlink directory {child_dir_path} - removing.")
                os.remove(child_dir_path)
            else:
                # Add the child dir to the queue
                cwd_dirs.append(child_dir_path)

After deleting symlinks I run os.symlink(symlink_src, symlink_dst) and generally run into the following error:

Traceback (most recent call last):
  File "C:\Users\me\my_repo\remakesymlinks.py", line 123, in main
    os.symlink(symlink_src, symlink_dst)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\me\\my_repo\\SharedPy' -> 'C:\\Users\\me\\my_repo\\microservices\\service\\symlink_file'

A workaround to specifically this error (inside the create symlink method) is:

try:
    os.symlink(symlink_src, symlink_dst)
except FileExistsError:
    os.remove(symlink_dst)
    os.symlink(symlink_src, symlink_dst)

But this is not ideal because it doesn't prevent a huge list of defunct/broken symlinks from piling up in the directory. I should be able to find any symlinks (working, broken, non-existent, etc.) and then delete them.

I have a list of the symlinks that should be created by my script, but extracting the list of targets from this list is also a workaround that also causes a 'symlink-leak'. Below is how I'm currently finding the broken symlink purely for testing purposes.

if not os.path.isdir(child_dir_path):
    if os.path.basename(child_dir_path) in [s.symlink_install_to for s in dirs_to_process]:
        print(f"-- Found symlink file {child_dir_path} - removing.")
        os.remove(child_dir_path)
    
    # skip the dir checking
    continue

A rudimentary solution where I filter for only 'text/plain' files with exactly 1 line (since checking anything else is pointless) and trying to determine whether that single line is just a file path (this seems excessive though):

# Check if Git downloaded the symlink as a plain text file (undetectable broken symlink)
if not os.path.isdir(child_dir_path):
    try:
        if magic.Magic(mime = True, uncompress = True).from_file(child_dir_path) == 'text/plain':
            with open(child_dir_path, 'r') as file:
                for i, line in enumerate(file):
                    if i >= 1:
                        raise StopIteration
                else:
                    # this function is directly copied from https://stackoverflow.com/a/34102855/8705841
                    if is_path_exists_or_creatable(line):
                        print(f"-- Found broken Git link file '{child_dir_path}' - removing.")
                        print(f"\tContents: \"{line}\"")
                        # os.remove(child_dir_path)
                        raise StopIteration
    except (StopIteration, UnicodeDecodeError, magic.MagicException):
        file.close()
        continue

Clearly this solution would require a lot of refactoring (9 indents is pretty ridiculous) if it's the only viable option. Problem with this solution (currently) is that it also tries to delete any single-line files with a string that does not break pathname requirements- i.e. import _virtualenv, some random test string, project-name. Filtering out those files with spaces, or files without slashes, could potentially work but this still feels like chasing edge cases instead of solving the actual problem.

I could potentially rewrite this script in Bash wherein I could, in addition to existing symlink search/destroy code, parse the results from git ls-files -s ... and then delete any files with the 120000 tag. But is this method feasible and/or reliable? There should be a way to do this from within Python since Bash isn't going to run on every OS.


Note: file names have been redacted/changed after copy-paste for privacy, they shouldn't matter anyways since they are generated dynamically by the path searching functions

elkshadow5
  • 369
  • 2
  • 4
  • 17
  • They show up as normal files when checked out because they *are* normal files. That is, Git was told "this OS / file system does not support symlinks". Git then replied with: "OK, I will find any symlinks, things mode 120000, and instead of attempting to *create* symlinks, which don't work, I'll create an ordinary file so that you can edit the file and update the symlink contents if you like, while keeping the `mode 120000` in the index so I know to commit it as a symlink later." – torek Dec 23 '22 at 01:33
  • If the OS does in fact support symlinks, you should just tell Git that it does so. However, Git *should* be finding this out correctly on its own: if that isn't working, that's a bug in the Git-for-Windows version you're using. – torek Dec 23 '22 at 01:33
  • Note that if Git is installed (which it must be to have done a `git switch` or `git checkout` here), you can run `git ls-files` from Python using `subprocess.Popen` and the rest of the subprocess family of stuff. – torek Dec 23 '22 at 01:34
  • (Well, let me amend one comment slightly: Git finds `chmod` support on its own, but might need some symlink support hints on Windows. I don't know for sure as I avoid Windows. I'd think in 2022 it should be automatic by now...) – torek Dec 23 '22 at 01:40
  • You have to check a specific checkbox (that's unselected by default) telling Git for Windows to check for symlink support and use them if possible. – elkshadow5 Dec 26 '22 at 20:13
  • @torek so then there's not really a way to solve this problem, except via running `git ls-files` and parsing the output. That's disappointing but thanks for letting me know – elkshadow5 Dec 26 '22 at 20:14
  • That's what I'd do, run `subprocess.Popen` on `["git", "ls-files", "-sz"]` and parse the resulting easily-machine-interpreted binary data. – torek Dec 27 '22 at 09:18

0 Answers0