0

I have a string that contains one or more filenames and/or file wildcards, e.g. "somefile.txt" or "somefile.txt *.pdf *.txt foo.bar".

I want to turn that into a single iterator that contains all of the matching files, and that iterator should not contain any duplicates. E.g., in the second example above, somefile.txt would naturally appear twice, once from the filename, once from *.txt.; I want it to appear only once in the iterator.

Here's what I've been playing with (most of which is from this SO question), which isn't de-duped. (I'm only printing for the example, there will obviously be processing in the for loop.)

import itertools as it
from glob import iglob

def glob_everything(filelist):
    return it.chain.from_iterable(iglob(f) for f in filelist)

parmfiles = "somefile.txt *.txt"
files = parmfiles.split()

for file in glob_everything(files):
    print('3',file)

I'm using iglob instead of glob because there might be several thousand files involved, and I was trying not to take up memory with all of them.

Is it possible to (easily) de-dup the iterator in the glob_everything function above? (I want the iterator and/or list de-duped before I start, I don't want to have to mess with figuring out if I've already seen a filename as I process the list.)

If not, I assume I'll have to glob the specs individually, extending a list each time, and then turn the list into a set (e.g. set(filelist)) to de-dup it.

Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
vr8ce
  • 476
  • 2
  • 13
  • I think you have an xy problem... I don't think you need to deduplicate the iterator... – Grady Player Jan 12 '18 at 22:10
  • 1
    Several thousand filenames - what is that, like 150kB? I'd just `set()` them. – Blorgbeard Jan 12 '18 at 22:10
  • Grady, not sure why you think so. In the example, the somefile.txt IS returned twice; I only want it once. Thus, it needs to be de-duped. Blorgbeard, yeah, you're probably right, in this instance. But I'm still curious whether it can be done, for times when there are very large quantities in the iterator. – vr8ce Jan 12 '18 at 22:16
  • 1
    Well in order to dedupe a stream of strings, you would have to add them to a set as you see them. Otherwise you can't answer the question "have I seen this string before?". – Blorgbeard Jan 12 '18 at 22:58

0 Answers0