Check parallelism of files using their suffixes

Question

Given a directory of files, e.g.:

mydir/
  test1.abc
  set123.abc
  jaja98.abc
  test1.xyz
  set123.xyz
  jaja98.xyz

I need to check that for every .abc file there is an equivalent .xyz file. I could do it like this:

>>> filenames = ['test1.abc', 'set123.abc', 'jaja98.abc', 'test1.xyz', 'set123.xyz', 'jaja98.xyz']
>>> suffixes = ('.abc', '.xyz')
>>> assert all( os.path.splitext(_filename)[0]+suffixes[1] in filenames for _filename in filenames if _filename.endswith(suffixes[0]) )

The above code should pass the assertion, while something like this would fail:

>>> filenames = ['test1.abc', 'set123.abc', 'jaja98.abc', 'test1.xyz', 'set123.xyz']
>>> suffixes = ('.abc', '.xyz')                                                                                 >>> assert all(os.path.splitext(_filename)[0]+suffixes[1] in filenames for _filename in filenames if _filename.endswith(suffixes[0]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

But that's a little too verbose.
Is there a better to do the same checks?

niemmi · Accepted Answer · 2016-10-07T02:36:12.563

2

You could define helper function that would return set of filenames without extension that match to the given suffix. Then you could easily check is files with suffix .abc is subset of files with suffix .xyz:

filenames = ['test1.abc', 'set123.abc', 'jaja98.abc', 'test1.xyz', 'set123.xyz', 'jaja98.xyz']
filenames2 = ['test1.abc', 'set123.abc', 'jaja98.abc', 'test1.xyz', 'set123.xyz']
suffixes = ('.abc', '.xyz')

def filter_ext(names, ext):
    return {n[:-len(ext)] for n in names if n.endswith(ext)}

assert filter_ext(filenames, suffixes[0]) <= filter_ext(filenames, suffixes[1])
assert filter_ext(filenames2, suffixes[0]) <= filter_ext(filenames2, suffixes[1]) # fail

Above approach would be more efficient as well since it has O(n) time complexity where as the original is O(n^2). Of course if the list is small this doesn't really matter.

edited Oct 07 '16 at 02:36

answered Oct 07 '16 at 02:13

niemmi

17,113
7
35
42

You could improve efficiency even more by using sets which do faster membership checking that a linear search through a list. – martineau Oct 07 '16 at 02:32
@martineau The example above is using sets. Of course you could convert the whole `filenames` to `set` and do the same thing as in question. It would still be **O(n)** although it might be bit more efficient in practice. – niemmi Oct 07 '16 at 02:40
Sorry, my mistake. Your statement about the time complexity threw me off. It's better than **O(n)**, it's (nearly) **O(1)** — see [_Time complexity of python set operations?_](http://stackoverflow.com/questions/7351459/time-complexity-of-python-set-operations) – martineau Oct 07 '16 at 02:45
@martineau: Yes, checking if single item is in `set` is **O(1)**. But you still need to iterate over `filenames` so complexity is **O(n)**. – niemmi Oct 07 '16 at 02:49

Check parallelism of files using their suffixes

1 Answers1