remove redundant entries from list of paths

Question

I have a list of files and directories. I'm trying to write a function to remove entries where there is also an entry for an ancestor directory present. What I have so far seems to work, but I think it is inefficient because it tests the full list of directories for every file.

Maybe there's a library out there to do this, but I can't find it. The purpose is to allow the user to choose a list of files and directories to upload.

As you can see from the example, directories are a subset of entries. I'd prefer to just provide the entries.

import os

def remove_redundant_entries(entries, directories):
    result = []
    for entry in entries:
        # make a copy and successively get the dirname and test it
        partial_path = entry
        found = False
        while partial_path != os.sep:
            partial_path = os.path.dirname(partial_path)
            if partial_path in directories:
                found = True
                break
        if not found:
            result.append(entry)
    return result


entries = [
    "/home/fred/work/f1.txt",
    "/home/fred/work/f2.txt",
    "/home/fred/play/f3.txt",
    "/home/fred/play",
    "/home/jane/dev/f1.txt",
    "/home/jane"]

directories = [
    "/home/fred/play",
    "/home/jane"]

print remove_redundant_entries(entries, directories)

# result:
['/home/fred/work/f1.txt', '/home/fred/work/f2.txt', '/home/fred/play', '/home/jane']

if you know of a library or can give a clue to a better algorithm I'd appreciate it. Meanwhile, I will try something based on sorting the entries, as ancestors should always precede their children in the list.

EDIT: - RESULTS

I ran all solutions 10,000 times through the profiler with the tests set - and with one file added /home/fred/work/f2.txt.bak to test make sure a regular filename does cause another to be discarded.

My original code: 1060004 function calls in 0.394 seconds

Stephen Rauch's answer - worked first time: 3250004 function calls in 2.089 seconds

carrdelling's answer - which didn't work for similar filenames: 480004 function calls in 0.146 seconds

carrdelling's edited answer - works for all cases: 680004 function calls in 0.231 seconds

Thanks to everyone who contributed!

Would it be simpler to use the `startswith` method available for string objects. This way for each entries you can decide whether to keep it or not based on it beginning with any of the directory strings? Not really simpler as an algorithm, but it removes the while loop. — domochevski, Mar 25 '18 at 17:01
I did try `startswith` but there could be a case where one regular file starts with the name of another regular file: `/somepath/foo.txt.bak , /somepath/foo.txt`. Having said that, its possible a regex could do it? — Julian Mann, Mar 25 '18 at 17:11
You don't need a regex - see my updated answer, if you sort properly your input list then you should be able to keep both `/somepath/foo.txt.bak` and `/somepath/foo.txt`, even when using `startswith` — carrdelling, Mar 25 '18 at 17:46

carrdelling · Accepted Answer · 2018-03-25T17:44:44.120

If you sort your input list of entries, then the problem is easier:

def remove_redundant_entries(entries):

    split_entries = sorted(entries)

    valid_entries = []

    for entry in split_entries:

        if any(entry.startswith(p) for p in valid_entries):
            continue
        valid_entries.append(entry)

    return valid_entries

Note that any short-circuits as soon as one comparison is true (you it would not compare agains the whole list unless strictly necessary). Also, since the list comes sorted, you are guaranteed that the output will have the minimum number (and highest level) paths.

EDIT:

If you also need the ability to keep in the list multiple files in the same folder (even if some file names are subsets of others), you just need to modify the sorting criteria:

split_entries = sorted(entries, key=lambda x: (x.count(os.sep), -len(x)))

With that, folders that are higher in the tree will come earlier (so you'll end up with the minimum number of paths), but within a folder files with longer names will come earlier - so they won't get discarded because of files with shorter (prefix-like) names.

Yes, secondary sorting on length solves the issue of similar filenames. I ran my original and all answers through the profiler. I will edit the question with the results. Thanks. — Julian Mann, Mar 25 '18 at 18:09

score 1 · Answer 2 · answered Mar 25 '18 at 17:17

You can use a set to lookup the already present more efficiently like:

Code:

def remove_redundant_entries(entries):
    present = set()
    result = []
    for entry in sorted(entries):
        path = os.path.abspath(entry).split(os.sep)
        found = any(
            tuple(path[:i+1]) in present for i in range(len(path)))
        if not found:
            result.append(entry)
            present.add(tuple(path))
    return result

Test Code:

import os

entries = [
    "/home/fred/work/f1.txt",
    "/home/fred/work/f2.txt",
    "/home/fred/play/f3.txt",
    "/home/fred/play",
    "/home/jane/dev/f1.txt",
    "/home/jane"]

result = remove_redundant_entries(entries)
expected = ['/home/fred/work/f1.txt', '/home/fred/work/f2.txt',
            '/home/fred/play', '/home/jane']
assert set(result) == set(expected)

Thanks for this! I was about to mark it accepted, then @carrdeling edited his answer in a way that made it work as well but faster. — Julian Mann, Mar 25 '18 at 18:07

remove redundant entries from list of paths

2 Answers2

Code:

Test Code: