0

I am running several different unix commands as subprocesses (using python's subprocess module) from python that generate files that will be used later on in a pipeline. I'd like to know if there is an elegant way to get a list of the files generated by these subprocesses. Currently I am just using something like this:

self.fastQFiles = []
for filename in os.listdir(self.workdir):
    if re.search(r'\.fastq$', filename, re.IGNORECASE):
        self.fastQFiles.append(self.workdir + "/" + filename)

To search all files in a working directory and return only those that match a given extension. If this is the only way I can probably make my regex more complicated and match all the expected file types, but I'm a little concerned that old files that match will show up in the search too, I suppose I could add a datetime component as well, but that just feels clunky.

Is there a cleaner way to return the names of files generated by a subprocess?

EDIT: After thinking about this some more, the most elegant solution I can think of is doing this by collection subtraction.

preCounter = Counter(os.listdir('/directory'))
subprocess.(processArguments)
postCounter = Counter(os.listdir('/directory'))
newFiles = list(postCounter - preCounter)

If there's a better way to do this, I'm still open to suggestion.

sage88
  • 4,104
  • 4
  • 31
  • 41
  • Does the sub process in question list the files it has created in its output? If so you can get it from that. – Gary van der Merwe Mar 27 '15 at 14:07
  • @Gary van der Merwe Unfortunately they don't. – sage88 Mar 27 '15 at 14:12
  • 1
    If you want to make it more maintainable then instead of `Counter` you could use the built-in `set.difference()`: https://docs.python.org/2/library/stdtypes.html#set.difference – Kashyap Mar 27 '15 at 14:53
  • @thekashyap Yeah that'd probably be better than Counter. My concern with this solution is that it won't find overwritten files as new. – sage88 Mar 27 '15 at 15:33

2 Answers2

0

Personally I prefer the simpler expressions. Easier to maintain. If you wanna show off you can do the same thing like this:

self.fastQFiles = [ff for ff in os.listdir(self.workdir) if re.search(r'\.fastq$', ff, re.IGNORECASE)]

OR

self.fastQFiles = filter(lambda ff: re.search(r'\.fastq$', ff, re.IGNORECASE), os.listdir(self.workdir)

OR use the good old glob.glob()

OR you can switch to using apply_async() and make your functions return name of file(s) they create. Then you would be able to simply get such a list without any post processing.

Kashyap
  • 15,354
  • 13
  • 64
  • 103
  • This does't answer my question at all and instead just gives me back my own answer. I'm also aware python's syntax can be used like that, but writing a 4 line expression out on one line is rarely more clear to anyone debugging your code. And glob.glob is difficult to use with ignore case, plus again is exactly what i've done already. – sage88 Mar 27 '15 at 14:11
  • @sage88 hence the preface 'I prefer the simpler expressions. Easier to maintain.'. Anyway, I was in process of updating it, still not sure what your definition of 'elegant' is so hard to answer the Q. – Kashyap Mar 27 '15 at 14:16
  • Wouldn't using the apply_async() function require that the subprocess files called return the names? The subprocesses I'm calling are files I didn't write, they're part of existing pipelines and it would be a really bad idea for me to modify them. – sage88 Mar 27 '15 at 14:24
  • I posted a possible solution as an edit to my question. It might give you a better idea of what I'm trying to accomplish. – sage88 Mar 27 '15 at 14:43
  • @sage88, for apply_async you have to 'make your functions return name of file(s)'. I have no other ideas meeting your definition of elegant. :-) – Kashyap Mar 27 '15 at 14:51
0

Alright so the solution that I came up with uses the DictDiffer class created by @hughdbrown in combination with os.stat(). I used os.stat() to get the st_mtime attribute which was when the files were last modified which can be used to show if a file has been overwritten from one time point to another. I store everything in dictionaries with the filenames as keys at the st_mtime as the values.

import os
workdir = '/path/to/directory'    
preFileStats = {}
for filename in os.listdir(workdir):
    preFileStats[filename] = os.stat(workdir + "/" + filename).st_mtime

subprocess.(processArguments)

postFileStats = {}
for filename in os.listdir(workdir):
    postFileStats[filename] = os.stat(workdir + "/" + filename).st_mtime

class DictDiffer(object):
    """
    Calculate the difference between two dictionaries as:
    (1) items added
    (2) items removed
    (3) keys same in both but changed values
    (4) keys same in both and unchanged values
    """
    def __init__(self, current_dict, past_dict):
        self.current_dict, self.past_dict = current_dict, past_dict
        self.set_current, self.set_past = set(current_dict.keys()), set(past_dict.keys())
        self.intersect = self.set_current.intersection(self.set_past)
    def added(self):
        return self.set_current - self.intersect 
    def removed(self):
        return self.set_past - self.intersect 
    def changed(self):
        return set(o for o in self.intersect if self.past_dict[o] != self.current_dict[o])
    def unchanged(self):
        return set(o for o in self.intersect if self.past_dict[o] == self.current_dict[o])

d = DictDiffer(postFileStats, preFileStats)
newFiles = list(d.changed()) + list(d.added())

Of course the DictDiffer class is very powerful and could be used for checking for removed files or unchanged files as well.

Community
  • 1
  • 1
sage88
  • 4,104
  • 4
  • 31
  • 41