1

I am trying to use glob.iglob to search for xml files in all subfolders of a specific folder. The problem is that there are some folders linked in and I get in some kind of neverending subfolderpath. For example:

First Level\
 Second Level A\
  Third Level: Link to Second Level B\
  Third Level: subfolder with xml files\
 Second Level B\
  Third Level: Link to Second Level A\
  Third Level: subfolder with xml files\

So I need to exclude some subfolder by their name. Is there a way to do so? I already tried to pass a list like:

glob.iglob([r'/**/*.xml', r'!/Link to Second Level B/'])

But this did not work for me.

Has anyone an idea how to solve this?

Thanks for your help!

CristiFati
  • 38,250
  • 9
  • 50
  • 87
Meret
  • 33
  • 6

1 Answers1

0

I want to start by pointing out that this kind of (recurring) symlinking is a sign of bad design. Any fix would be fixing the effect of the problem not the cause ("sweeping the dirt under the carpet").

Unfortunately, (the recurring) glob doesn't allow filtering, nor does it provide access to elements while enumerating them. So, you need another way, by enumerating the dir elements yourself (using one of many existing ways - you could take a look at [SO]: How do I check whether a file exists without exceptions? (@CristiFati's answer)) and filter out the unwanted ones.

Here's the test dir structure. Note that here, the 2 recurring symlinks are actually normal dirs, otherwise they would have messed up the command (which doesn't handle this case either). I replaced them by symlinks afterwards:

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q057591233]> tree /a /f .
Folder PATH listing for volume Work
Volume serial number is 3655-6FED
E:\WORK\DEV\STACKOVERFLOW\Q057591233
|   code00.py
|
+---external_dir
|       file00.xml
|
\---search_dir
    |   file00.xml
    |   file01.xml
    |
    +---dir00
    |   +---dir00
    |   |   |   file00.xml
    |   |   |
    |   |   \---dir00
    |   |           file00.xml
    |   |
    |   \---dir01_symlink_to_parent_sibbling_dir01
    \---dir01
        +---dir00_symlink_to_parent_sibbling_dir00
        +---dir01
        |       file00.xml
        |
        \---dir02_symlink_to_external_dir
                file00_ext.xml

code00.py:

#!/usr/bin/env python3

import sys
import os
import re
import pprint


def _get_files_os_scandir_no_symlikns(dir_name, match_func, level=0):
    for item in os.scandir(dir_name):
        if item.is_symlink():
            continue
        if item.is_dir():
            yield from _get_files_os_scandir_no_symlikns(item.path, match_func, level=level + 1)
        elif match_func(item.path):
            yield item.path


def _get_files_os_scandir(dir_name, match_func, visited_inodes, level=0):
    for item in os.scandir(dir_name):
        if item.inode() in visited_inodes:
            continue
        visited_inodes.append(item.inode())
        item_path = os.path.normpath(os.path.join(*os.path.split(item.path)[:-1], os.readlink(item.path))) if item.is_symlink() else item.path
        if item.is_dir():
            yield from _get_files_os_scandir(item_path, match_func, visited_inodes, level=level + 1)
        elif match_func(item_path):
            yield item_path


def get_files(path, ext, exclude_symlinks=True):
    if exclude_symlinks and os.path.islink(path):
        return
    pattern = re.compile(".*\.{0:s}$".format(ext))
    if os.path.isdir(path):
        if exclude_symlinks:
            yield from _get_files_os_scandir_no_symlikns(path, pattern.match)
        else:
            yield from _get_files_os_scandir(path, pattern.match, list())
    elif os.path.isfile(path) and pattern.match(path):
        yield path


def main():
    search_dir = "search_dir"
    extension = "xml"
    for exclude_symlinks in [True, False]:
        print("\nExclude symlinks: {0:}".format(exclude_symlinks))
        files = list(get_files(search_dir, extension, exclude_symlinks=exclude_symlinks))
        pprint.pprint(files)
        print("Total items: {0:d}".format(len(files)))


if __name__ == "__main__":
    print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    main()
    print("\nDone.")

Output:

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q057591233]> "e:\Work\Dev\VEnvs\py_064_03.07.03_test0\Scripts\python.exe" code00.py
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)] 64bit on win32


Exclude symlinks: True
['search_dir\\dir00\\dir00\\dir00\\file00.xml',
 'search_dir\\dir00\\dir00\\file00.xml',
 'search_dir\\dir01\\dir01\\file00.xml',
 'search_dir\\file00.xml',
 'search_dir\\file01.xml']
Total items: 5

Exclude symlinks: False
['search_dir\\dir00\\dir00\\dir00\\file00.xml',
 'search_dir\\dir00\\dir00\\file00.xml',
 'search_dir\\dir01\\dir01\\file00.xml',
 'external_dir\\file00_ext.xml',
 'search_dir\\file00.xml',
 'search_dir\\file01.xml']
Total items: 6

Done.

Notes:

  • The recursive implementation relies on [Python 3.Docs]: os.scandir(path='.') (and other file / dir functions)
  • In terms of file name matching, there's no wildcards support, so the closest (?) thing (regexp) is used
  • The 2 functions traversing the dir:
    • _get_files_os_scandir_no_symlikns - ignores all symlinks
    • _get_files_os_scandir - includes symlinks. Also does some processing to avoid infinite recursion and for symlink resolution
  • The 2 functions could have been unified (with an extra argument (exclude_symlinks)), but I got a feeling that the one ignoring them performs much faster this way
  • As seen, none enters infinite recursion (for the former it's obvious), but the former also omits the file external to the search dir
  • get_files_os_scandir - a wrapper that calls either one of the 2, after it does some initialization work (to avoid doing it by each recurring call)
  • I only ran the code on Win, but I ran parts of it on Nix as well, so I'm not expecting any surprises there
CristiFati
  • 38,250
  • 9
  • 50
  • 87
  • Unfortunately I can't change this symlinking. Meanwhile I changed my program to search only in subfolders without symlinking but now I will try your solution. So thanks a lot for your work! – Meret Sep 02 '19 at 06:45
  • Noone said to change *symlink*. This is why I posted the code :). Searching each dir is OK too, but if there are many subdirs, and the *symlink*s are somewhere lower in the tree, it won't work. – CristiFati Sep 02 '19 at 06:55