I have a string that contains one or more filenames and/or file wildcards, e.g. "somefile.txt" or "somefile.txt *.pdf *.txt foo.bar".
I want to turn that into a single iterator that contains all of the matching files, and that iterator should not contain any duplicates. E.g., in the second example above, somefile.txt would naturally appear twice, once from the filename, once from *.txt.; I want it to appear only once in the iterator.
Here's what I've been playing with (most of which is from this SO question), which isn't de-duped. (I'm only printing for the example, there will obviously be processing in the for loop.)
import itertools as it
from glob import iglob
def glob_everything(filelist):
return it.chain.from_iterable(iglob(f) for f in filelist)
parmfiles = "somefile.txt *.txt"
files = parmfiles.split()
for file in glob_everything(files):
print('3',file)
I'm using iglob instead of glob because there might be several thousand files involved, and I was trying not to take up memory with all of them.
Is it possible to (easily) de-dup the iterator in the glob_everything function above? (I want the iterator and/or list de-duped before I start, I don't want to have to mess with figuring out if I've already seen a filename as I process the list.)
If not, I assume I'll have to glob the specs individually, extending a list each time, and then turn the list into a set (e.g. set(filelist)) to de-dup it.