Marking dynamic substrings in a list of strings

Question

Assume these two sets of strings:

file=sheet-2016-12-08.xlsx
file=sheet-2016-11-21.xlsx
file=sheet-2016-11-12.xlsx
file=sheet-2016-11-08.xlsx
file=sheet-2016-10-22.xlsx
file=sheet-2016-09-29.xlsx
file=sheet-2016-09-05.xlsx
file=sheet-2016-09-04.xlsx

size=1024KB
size=22KB
size=980KB
size=15019KB
size=202KB

I need to run a function on both these sets separately and receive the following output respectively:

file=sheet-2016-*.xlsx

size=*KB

The dataset can be any set of strings. It doesn't have to match the format. Here's another example for instance:

id.4030.paid
id.1280.paid
id.88.paid

For which the expected output would be:

id.*.paid

Basically, I need a function to analyze a set of strings and replace the uncommon substring with an asterisk(*)

Jean-François Fabre · Accepted Answer · 2017-08-26T16:54:50.477

you could use os.path.commonprefix to compute the common prefix. It is used to compute shared directories in a list of filepaths, but it can be used in a generic context.

Then reverse the strings, and apply common prefix again, then reverse, to compute common suffix (adapted from https://gist.github.com/willwest/ca5d050fdf15232a9e67)

dataset = """id.4030.paid
id.1280.paid
id.88.paid""".splitlines()

import os


# Return the longest common suffix in a list of strings
def longest_common_suffix(list_of_strings):
    reversed_strings = [s[::-1] for s in list_of_strings]
    return os.path.commonprefix(reversed_strings)[::-1]

common_prefix = os.path.commonprefix(dataset)
common_suffix = longest_common_suffix(dataset)

print("{}*{}".format(common_prefix,common_suffix))

result:

id.*.paid

EDIT: as wim noted:

when all strings are equal, common prefixes & suffixes are the same, but it should return the string itself instead of prefix*suffix: should check if all strings are the same
when common prefix & suffixes overlap/have shared letters, this confuses the computation as well: should compute common suffix on the string minus the common prefix

So a all-in-one method is required to test the list beforehand to make sure that at least 2 strings are different (condensing the prefix/suffix formula in the process), and compute common suffix with slicing to remove common prefix:

def compute_generic_string(dataset):
    # edge case where all strings are the same
    if len(set(dataset))==1:
        return dataset[0]

    commonprefix = os.path.commonprefix(dataset)

    return "{}*{}".format(commonprefix,os.path.commonprefix([s[len(commonprefix):][::-1] for s in dataset])[::-1])

now let's test this:

for dataset in [['id.4030.paid','id.1280.paid','id.88.paid'],['aBBc', 'aBc'],[]]:
    print(compute_generic_string(dataset))

result:

id.*.paid
aB*c
*

(when dataset is empty, code returns *, maybe that should be another edge case)

Dang, `os.path.commonprefix`! How long has that been around. — wim, Aug 25 '17 at 22:43
os.path.commonprefix is not very difficult to implement, but a bit tedious. So it's great that it's built-in. — Jean-François Fabre, Aug 26 '17 at 07:42
This still has bugs. For example, input `['aBBc', 'aBc']` returns `'aB*Bc'` due to overlap of the common prefix and suffix. — wim, Aug 26 '17 at 16:43
@wim: right! tough to spot and tough to fix. reverse wildcard isn't that easy. I take on the challenge. — Jean-François Fabre, Aug 26 '17 at 16:49

wim · Answer 2 · 2017-08-26T16:45:32.393

1

from os.path import commonprefix

def commonsuffix(m):
    return commonprefix([s[::-1] for s in m])[::-1]

def inverse_glob(strs):
    start = commonprefix(strs)
    n = len(start)
    ends = [s[n:] for s in strs]
    end = commonsuffix(ends)
    if start and not any(ends):
        return start
    else:
        return start + '*' + end

This problem is trickier than it seems at face value.

As currently specified, the question remains poorly constrained, i.e. there is not a unique solution. For input ['spamAndEggs', 'spamAndHamAndEggs'], both spam*AndEggs and spamAnd*Eggs are valid answers. For input ['aXXXXz', 'aXXXz'] there are four possible solutions. In the code given above, we prefer choosing the longest possible prefix in order to make the solution unique.

Credit to JFF's answer for pointing out the existence of os.path.commonprefix.

Inverse glob - reverse engineer a wildcard string from file names is a related and more difficult generalisation of this question.

edited Aug 26 '17 at 16:45

answered Aug 25 '17 at 22:41

wim

338,267
99
616
750

thanks for your comments & help to make my solution better. Some may object that your solution is a copy of mine, but without your comments, I couldn't have achieved a working one. – Jean-François Fabre Aug 26 '17 at 21:00
FWIW my [original solution](https://stackoverflow.com/revisions/45890262/1) had same bug as yours. Deleted it when I saw your implementation, that was better. – wim Aug 26 '17 at 21:09
we can say we beat that one together :) – Jean-François Fabre Aug 26 '17 at 21:09

Marking dynamic substrings in a list of strings

2 Answers2