25

My problem is to find the common path prefix of a given set of files.

Literally I was expecting that "os.path.commonprefix" would do just that. Unfortunately, the fact that commonprefix is located in path is rather misleading, since it actually will search for string prefixes.

The question to me is, how can this actually be solved for paths? The issue was briefly mentioned in this (fairly high rated) answer but only as a side-note and the proposed solution (appending slashes to the input of commonprefix) imho has issues, since it will fail for instance for:

os.path.commonprefix(['/usr/var1/log/', '/usr/var2/log/'])
# returns /usr/var but it should be /usr

To prevent others from falling into the same trap, it might be worthwhile to discuss this issue in a separate question: Is there a simple / portable solution for this problem that does not rely on nasty checks on the file system (i.e., access the result of commonprefix and check whether it is a directory and if not returns a os.path.dirname of the result)?

bluenote10
  • 23,414
  • 14
  • 122
  • 178

5 Answers5

23

It seems that this issue has been corrected in recent versions of Python. New in version 3.5 is the function os.path.commonpath(), which returns the common path instead of the common string prefix.

Dan Getz
  • 8,774
  • 6
  • 30
  • 64
cjac
  • 386
  • 2
  • 5
  • 2
    Thanks! This is Yet Another Reason to switch to python 3 -- glad I finally did it last year. – Owen Jun 13 '16 at 21:49
16

Awhile ago I ran into this where os.path.commonprefix is a string prefix and not a path prefix as would be expected. So I wrote the following:

def commonprefix(l):
    # this unlike the os.path.commonprefix version
    # always returns path prefixes as it compares
    # path component wise
    cp = []
    ls = [p.split('/') for p in l]
    ml = min( len(p) for p in ls )

    for i in range(ml):

        s = set( p[i] for p in ls )         
        if len(s) != 1:
            break

        cp.append(s.pop())

    return '/'.join(cp)

it could be made more portable by replacing '/' with os.path.sep.

Dan D.
  • 73,243
  • 15
  • 104
  • 123
  • 1
    I accepted this answer since it is fairly robust while maintaining good brevity. It is interesting to see that in contrast to a solution based on `os.path.commenprefix` (see Dan Getz's answer), the component-wise comparison does not depend on whether the input paths are file or directory names (simply because filenames are unique components). For a more robust approach I recommend EOL's answer and to learn more about the problems of `os.path.commonprefix` Dan Getz' answer is very instructive. Thank you all! – bluenote10 Feb 03 '14 at 11:49
  • 3
    This gives a reasonable answer with both file or directory names, but in the general case you can't know for certain if the result of the function is a file or directory name without doing more work (such as by controlling the input). Also, I believe this returns the exact same result as `os.path.dirname(os.path.commonprefix([p + '/' for p in l]))`? – Dan Getz Feb 03 '14 at 15:14
  • @DanGetz It apparently does not do the same. I just tried it on windows by replacing it with `os.path.sep` and the code provided by @DanD results in the correct common path while your snippet returns `None` – thatsIch Jan 26 '17 at 21:57
  • @thatsIch what inputs cause that? I was unaware `dirname` *could* return `None`. – Dan Getz Jan 26 '17 at 22:04
  • @DanGetz I must have messed it up somewhere. I just tried it again in a fresh environment and it works. My bad. – thatsIch Jan 28 '17 at 13:09
7

Assuming you want the common directory path, one way is to:

  1. Use only directory paths as input. If your input value is a file name, call os.path.dirname(filename) to get its directory path.
  2. "Normalize" all the paths so that they are relative to the same thing and don't include double separators. The easiest way to do this is by calling os.path.abspath( ) to get the path relative to the root. (You might also want to use os.path.realpath( ) to remove symbolic links.)
  3. Add a final separator (found portably with os.path.sep or os.sep) to the end of all the normalized directory paths.
  4. Call os.path.dirname( ) on the result of os.path.commonprefix( ).

In code (without removing symbolic links):

def common_path(directories):
    norm_paths = [os.path.abspath(p) + os.path.sep for p in directories]
    return os.path.dirname(os.path.commonprefix(norm_paths))

def common_path_of_filenames(filenames):
    return common_path([os.path.dirname(f) for f in filenames])
Dan Getz
  • 8,774
  • 6
  • 30
  • 64
  • When I was thinking about this idea I rejected it initially, because I though appending a slash is a problem when working with relative paths, like a blank file name in the current directory. However, when wrapping all paths into `os.path.abspath`, even a mixture of relative and absolute paths should be no problem, right? – bluenote10 Feb 01 '14 at 15:09
  • @bluenote10 Right, `abspath` instead of `normpath` will handle the relative paths. Not sure about blank file names, because that's ambiguous with duplicated path separators. Does anyone allow zero-length file names? – Dan Getz Feb 01 '14 at 15:18
  • Oh, I didn't mean empty file names, just `"aSimpleFileName"`. – bluenote10 Feb 01 '14 at 15:20
  • One should probably mention that it all depends on whether `paths` are _file_ paths or _directory_ paths (this must be a convention in case we want to avoid file system access). In my problem `paths` would be in fact files not directories. I think in this case the proper order is: (a) convert to abspath to deal with mixture of relative/absolute paths; (b) apply dirname to convert to a proper directory. From that point I think we can safely apply your solution. – bluenote10 Feb 01 '14 at 15:43
  • Oh, now I see it, good point. You need to know in advance if your string is a file or directory path. I thought I was getting around that, but instead I was just assuming directories. – Dan Getz Feb 02 '14 at 01:36
  • I think it is important to mention that one cannot simply take the dirname on the input. I tried to edit the question to make that clear. Feel free to revert if I screwed things up :). – bluenote10 Feb 02 '14 at 12:15
  • Hey, looks like someone already reverted it. I looked at what you wrote, and tried to rewrite my answer to make it clearer about how you need to be careful to not use file paths (that is, to get the directory path first). – Dan Getz Feb 02 '14 at 18:42
  • In my edit I tried to explain why what you suggested does not work for files. I think it is not possible to call dirname on the input to `common_path`. This would discard information on relative paths, because `abspath(dirname("somefile.txt")) != dirname(abspath("somefile.txt"))`. Imho it is necessary to have a separate version of `common_path`, which internally does _first_ abspath _then_ dirname (the other way around compared to applying dirname to the argument). Don't know why my edit was discarded :(. – bluenote10 Feb 03 '14 at 11:38
  • For a relative path, I see no problem. Are you talking about file links? Do you have an example where it really is true that `abspath(dirname(x)) != dirname(abspath(x))`? They're equal for the example you gave. – Dan Getz Feb 03 '14 at 15:03
  • You're right! I was wrongly assuming that abspath of an empty string would be evaluated to "the absolute path that does not even contain the root slash", i.e., another empty string. I'm glad I finally see the cause of my confusion. Thanks for making that clear! – bluenote10 Feb 03 '14 at 15:39
2

A robust approach is to split the path into individual components and then find the longest common prefix of the component lists.

Here is an implementation which is cross-platform and can be generalized easily to more than two paths:

import os.path
import itertools

def components(path):
    '''
    Returns the individual components of the given file path
    string (for the local operating system).

    The returned components, when joined with os.path.join(), point to
    the same location as the original path.
    '''
    components = []
    # The loop guarantees that the returned components can be
    # os.path.joined with the path separator and point to the same
    # location:    
    while True:
        (new_path, tail) = os.path.split(path)  # Works on any platform
        components.append(tail)        
        if new_path == path:  # Root (including drive, on Windows) reached
            break
        path = new_path
    components.append(new_path)

    components.reverse()  # First component first 
    return components

def longest_prefix(iter0, iter1):
    '''
    Returns the longest common prefix of the given two iterables.
    '''
    longest_prefix = []
    for (elmt0, elmt1) in itertools.izip(iter0, iter1):
        if elmt0 != elmt1:
            break
        longest_prefix.append(elmt0)
    return longest_prefix

def common_prefix_path(path0, path1):
    return os.path.join(*longest_prefix(components(path0), components(path1)))

# For Unix:
assert common_prefix_path('/', '/usr') == '/'
assert common_prefix_path('/usr/var1/log/', '/usr/var2/log/') == '/usr'
assert common_prefix_path('/usr/var/log1/', '/usr/var/log2/') == '/usr/var'
assert common_prefix_path('/usr/var/log', '/usr/var/log2') == '/usr/var'
assert common_prefix_path('/usr/var/log', '/usr/var/log') == '/usr/var/log'
# Only for Windows:
# assert common_prefix_path(r'C:\Programs\Me', r'C:\Programs') == r'C:\Programs'
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
2

I've made a small python package commonpath to find common paths from a list. Comes with a few nice options.

https://github.com/faph/Common-Path

faph
  • 1,605
  • 13
  • 12