you could use os.path.commonprefix
to compute the common prefix. It is used to compute shared directories in a list of filepaths, but it can be used in a generic context.
Then reverse the strings, and apply common prefix again, then reverse, to compute common suffix (adapted from https://gist.github.com/willwest/ca5d050fdf15232a9e67)
dataset = """id.4030.paid
id.1280.paid
id.88.paid""".splitlines()
import os
# Return the longest common suffix in a list of strings
def longest_common_suffix(list_of_strings):
reversed_strings = [s[::-1] for s in list_of_strings]
return os.path.commonprefix(reversed_strings)[::-1]
common_prefix = os.path.commonprefix(dataset)
common_suffix = longest_common_suffix(dataset)
print("{}*{}".format(common_prefix,common_suffix))
result:
id.*.paid
EDIT: as wim noted:
- when all strings are equal, common prefixes & suffixes are the same, but it should return the string itself instead of
prefix*suffix
: should check if all strings are the same
- when common prefix & suffixes overlap/have shared letters, this confuses the computation as well: should compute common suffix on the string minus the common prefix
So a all-in-one method is required to test the list beforehand to make sure that at least 2 strings are different (condensing the prefix/suffix formula in the process), and compute common suffix with slicing to remove common prefix:
def compute_generic_string(dataset):
# edge case where all strings are the same
if len(set(dataset))==1:
return dataset[0]
commonprefix = os.path.commonprefix(dataset)
return "{}*{}".format(commonprefix,os.path.commonprefix([s[len(commonprefix):][::-1] for s in dataset])[::-1])
now let's test this:
for dataset in [['id.4030.paid','id.1280.paid','id.88.paid'],['aBBc', 'aBc'],[]]:
print(compute_generic_string(dataset))
result:
id.*.paid
aB*c
*
(when dataset is empty, code returns *
, maybe that should be another edge case)