3

I want to generate a wildcard string from a pair of file names. Kind of an inverse-glob. Example:

file1 = 'some foo file.txt'
file2 = 'some bar file.txt'
assert 'some * file.txt' == inverse_glob(file1, file2)

Use difflib perhaps? Has this been solved already?

Application is a large set of data files with similar names. I want to compare each pair of file names and then present a comparison of pairs of files with "similar" names. I figure if I can do a reverse-glob on each pair, then those pairs with "good" wildcards (e.g. not lots*of*stars*.txt nor *) are good candidates for comparison. So I might take the output of this putative inverse_glob() and reject wildcards that have more than one * or for which glob() doesn't produce exactly two files.

Community
  • 1
  • 1
Bob Stein
  • 16,271
  • 10
  • 88
  • 101
  • The general solution to this problem is probably not very easily found. You talk about the filenames being similar. It is most likely possible to find a simpler solution, taking advantage of this. Perhaps you can give some examples of typical filenames? – JohanL May 05 '17 at 18:26
  • @JohanL thanks for thinking outside the box. I'd like to focus this article on the glob-inverse strategy so the question is more useful for posterity. I did simplify the question to 2 files, which you're right is much less general and much simpler. To answer, my files in the wild differ in different ways in different places, e.g. "filename.txt" and "filename2.txt" or "the 24MHz run new.sr" and "the 16MHz run old.sr" – Bob Stein May 05 '17 at 18:45

2 Answers2

2

For instance:

Filenames:

names = [('some foo file.txt','some bar file.txt', 'some * file.txt'),
         ("filename.txt", "filename2.txt", "filenam*.txt"),
         ("1filename.txt", "filename2.txt", "*.txt"),
         ("inverse_glob", "inverse_glob2", "inverse_glo*"),
         ("the 24MHz run new.sr", "the 16MHz run old.sr", "the *MHz run *.sr")]

def inverse_glob(...):

    import re
    def inverse_glob(f1, f2, force_single_asterisk=None):
        def adjust_name(pp, diff):
            if len(pp) == 2:
                return pp[0][:-diff] + '?'*(diff+1) + '.' + pp[1]
            else:
                return pp[0][:-diff] + '?' * (diff + 1)

        l1 = len(f1); l2 = len(f2)
        if l1 > l2:
            f2 = adjust_name(f2.split('.'), l1-l2)
        elif l2 > l1:
            f1 = adjust_name(f1.split('.'), l2-l1)

        result = ['?' for n in range(len(f1))]
        for i, c in enumerate(f1):
            if c == f2[i]:
                result[i] = c

        result = ''.join(result)
        result = re.sub(r'\?{2,}', '*', result)
        if force_single_asterisk:
            result = re.sub(r'\*.+\*', '*', result)
        return result

Usage:

for name in names:
    result = inverse_glob(name[0], name[1])
    print('{:20} <=> {:20} = {}'.format(name[0], name[1], result))
    assert name[2] == result

Output:

some foo file.txt    <=> some bar file.txt    = some * file.txt  
filename.txt         <=> filename2.txt        = filenam*.txt  
1filename.txt        <=> filename2.txt        = *.txt  
inverse_glob         <=> inverse_glob2        = inverse_glo*
the 24MHz run new.sr <=> the 16MHz run old.sr = the *MHz run *.sr

Tested with Python:3.4.2

stovfl
  • 14,998
  • 7
  • 24
  • 51
0

Here's what I use. It handles more than two files, and handles path separators appropriately, producing '**' where a recursive glob would be necessary:

import os
import re
import difflib

def bolg(filepaths, minOrphanCharacters=2):
    """
    Approximate inverse of `glob.glob`: take a sequence of `filepaths`
    and compute a glob pattern that matches them all. Only the star
    character will be used (no question marks or square brackets).

    Define an "orphan" substring as a sequence of characters, not
    including a file separator, that is sandwiched between two stars.   
    Orphan substrings shorter than `minOrphanCharacters` will be
    reduced to a star. If you don't mind having short orphan
    substrings in your result, set `minOrphanCharacters=1` or 0.
    Then you might get ugly results like '*0*2*.txt' (which contains
    two orphan substrings, both of length 1).
    """
    if os.path.sep == '\\':
        # On Windows, convert to forward-slashes (Python can handle
        # it, and Windows doesn't permit them in filenames anyway):
        filepaths = [filepath.replace('\\', '/') for filepath in filepaths]
    out = ''
    for filepath in filepaths:
        if not out: out = filepath; continue
        # Replace differing characters with stars:
        out = ''.join(x[-1] if x[0] == ' ' or x[-1] == '/' else '*' for x in difflib.ndiff(out, filepath))
        # Collapse multiple consecutive stars into one:
        out = re.sub(r'\*+', '*', out)
    # Deal with short orphan substrings:
    if minOrphanCharacters > 1:
        pattern = r'\*+[^/]{0,%d}\*+' % (minOrphanCharacters - 1)
        while True:
            reduced = re.sub(pattern, '*', out)
            if reduced == out: break
            out = reduced
    # Collapse any intermediate-directory globbing into a double-star:
    out = re.sub(r'(^|/).*\*.*/', r'\1**/', out)
    return out
jez
  • 14,867
  • 5
  • 37
  • 64
  • Love the name (bolg is glob backward). Modifying the filepaths input in-place is a little creepy. As is the infinite loop potential (or difficulty to rule it out). Less so the flagrant PEP8 violations. Good 'splainin in the docstring. Examples would be nice, e.g. `assert '*.txt' == bolg(('foo.txt','bar.txt'))` – Bob Stein Nov 11 '22 at 21:59
  • @BobStein There’s no in-place modification going on. `filepaths` is just iterated-through once, to read the strings. (It may or may not have been *replaced* locally by a new list before that, but that doesn’t modify the original sequence, which could just as well be a tuple or generator.) – jez Nov 11 '22 at 23:57
  • Oh of course, I see it now. You totally rebuild that list. – Bob Stein Nov 12 '22 at 00:24
  • As for the `while True` I guess it may look alarming, but it's not so difficult to prove convergence: the `sub()` operation reduces a substring containing stars to a star, so there must be either no change (instant exit) or a shortening of the string (which can only happen a finite number of times). – jez Nov 12 '22 at 00:39