I have a list of files and directories. I'm trying to write a function to remove entries where there is also an entry for an ancestor directory present. What I have so far seems to work, but I think it is inefficient because it tests the full list of directories for every file.
Maybe there's a library out there to do this, but I can't find it. The purpose is to allow the user to choose a list of files and directories to upload.
As you can see from the example, directories are a subset of entries. I'd prefer to just provide the entries.
import os
def remove_redundant_entries(entries, directories):
result = []
for entry in entries:
# make a copy and successively get the dirname and test it
partial_path = entry
found = False
while partial_path != os.sep:
partial_path = os.path.dirname(partial_path)
if partial_path in directories:
found = True
break
if not found:
result.append(entry)
return result
entries = [
"/home/fred/work/f1.txt",
"/home/fred/work/f2.txt",
"/home/fred/play/f3.txt",
"/home/fred/play",
"/home/jane/dev/f1.txt",
"/home/jane"]
directories = [
"/home/fred/play",
"/home/jane"]
print remove_redundant_entries(entries, directories)
# result:
['/home/fred/work/f1.txt', '/home/fred/work/f2.txt', '/home/fred/play', '/home/jane']
if you know of a library or can give a clue to a better algorithm I'd appreciate it. Meanwhile, I will try something based on sorting the entries, as ancestors should always precede their children in the list.
EDIT: - RESULTS
I ran all solutions 10,000 times through the profiler with the tests set - and with one file added /home/fred/work/f2.txt.bak
to test make sure a regular filename does cause another to be discarded.
My original code: 1060004 function calls in 0.394 seconds
Stephen Rauch's answer - worked first time: 3250004 function calls in 2.089 seconds
carrdelling's answer - which didn't work for similar filenames: 480004 function calls in 0.146 seconds
carrdelling's edited answer - works for all cases: 680004 function calls in 0.231 seconds
Thanks to everyone who contributed!