Pythonic way to "merge" strings, handling all possible lengths

Question

I'm trying to solve a fairly common problem in bioinformatics without resorting to a bunch of if statements.

The problem at hand:

I'm given two overlapping strings and a length of the expected output, and I want to produce a merged string. Here all the ways the strings might overlap: (in the following examples a - denotes that there is nothing in that string at that position. the consensus() bit is explained after the examples.):

# size=13
xxxOVERLAP---
---OVERLAPyyy
# expected output: xxx + consensus(xOVERLAP, yOVERLAP) + yyy


# size=7
---OVERLAPxxx
yyyOVERLAP---
# expected output: consensus(xOVERLAP, yOVERLAP)


# size=7
OVERLAP
OVERLAP
# expected output: consensus(xOVERLAP, yOVERLAP)

# size=10
xxxOVERLAP
---OVERLAP
# expected output: xxx + consensus(xOVERLAP, yOVERLAP)

# size=10
OVERLAP---
OVERLAPyyy
# expected output: consensus(xOVERLAP, yOVERLAP) + yyy

# size > len(x) + len(y)
# no overlap, produce error:
xxx---
---yyy
# expected output: error

The resulting merged string needs to start with the beginning of x and end with the end of y. The region that overlaps need to be passed to another function, consensus() that deals with merging the overlapped region. Here all the ways the strings might overlap: (in the following examples a - denotes that there is nothing in that string at that position)

def merge(x, y, size):
    # do the mergeing
    return part of x that doesn't overlap + consensus(overlap) + part of y that doesn't overlap.

I can code up a mess of if statements to recognize each case and deal with it individually, but I've been struggling to find a more elegant solution. One approach I considered is padding the strings (the end of x and the beginning of y) so that all cases look like the second example, but this seems too inefficient to be palatable, since i'd be making new strings when i did that and I'm applying this function to millions of strings.

I don't follow the description. In your examples, you provide three inputs, but did not provide the output for each example. — user590028, Jul 01 '16 at 17:32
You can start from here https://en.wikipedia.org/wiki/Longest_common_substring_problem — BlackBear, Jul 01 '16 at 17:39
in this case, I know the size of the string I want upfront, so the overlap is a deterministic function of the size of `x`, `y`, and the given `size`. Also, the `OVERLAP` region doesn't necessarily match 100%, which is why i need to pass that region of both `x` and `y` to the consensus function. — elsherbini, Jul 01 '16 at 17:42
How do we decide which way to line up the strings? (Are you using Biopython? This kind of job sounds like something you should be relying on a bioinformatics library for.) — user2357112, Jul 01 '16 at 17:49
Yes I'm using biopython. The resulting string should always start with the beginning of `x` and end with the end of `y`. — elsherbini, Jul 01 '16 at 17:52
Also I presume you don't know what the `OVERLAP` part is ahead of time? — DaveBensonPhillips, Jul 01 '16 at 18:00
The overlap region is totally determined by the input, you don't have to search for it. If `len(x) = 25`, `len(y) = 25`, and `size=20`, you know that the xoverlap region is `x[:20]` and yoverlap is `y[5:]`. — elsherbini, Jul 01 '16 at 18:02
Because the result always starts with the beginning of x and ends with the end of y. — elsherbini, Jul 01 '16 at 18:31
You might want to consider looking at the `difflib` library and `SequenceMatcher`... — Jon Clements, Jul 04 '16 at 05:58

score 0 · Answer 1 · edited May 23 '17 at 12:15

0

I would start with a generator that yields each character:

def merge_gen(x, y, overhang):
    buffer = ' ' * overhang
    for s in map(set, zip(buffer + x, y + buffer)):
        yield max(s)

Where overhang is len(x) - size (see below)

That works as follows:

>>> list(merge_gen('OVERLAPXXX', 'YYYOVERLAP', 3))
['Y', 'Y', 'Y', 'O', 'V', 'E', 'R', 'L', 'A', 'P', 'X', 'X', 'X']

You can then implement a merge function including the consensus function as follows:

def merge(x, y, size):
    length = len(x)
    overhang = size - length
    overlap = length - overhang
    gen = merge_gen(x, y, overhang)

    result = ''
    result += ''.join(next(gen) for _ in range(overhang))
    result += consensus(''.join(next(gen) for _ in range(overlap)))
    result += ''.join(next(gen) for _ in range(overhang))
    return result

I would hope that this is reasonably efficient in Python3; lots of generators, few wasted strings to be discarded, etc.

(*) Apparently this is a fast way of getting a single item from a set. In this case we know the set only has one element, and we just want to extract.

edited May 23 '17 at 12:15

Community

1
1

answered Jul 01 '16 at 18:34

DaveBensonPhillips

3,134
1
20
32

This seems cool, but at the moment: `x= "OVERLAPxxx", y=yyyOVERLAP; merge(x,y,7) => 'yyyOVEROAPRLA'` – elsherbini Jul 01 '16 at 19:05
@elsherbini Fixed, I had a typo :) – DaveBensonPhillips Jul 01 '16 at 19:40
Hmm, expected output would just be `"OVERLAP"` for that input, but instead you get `"yyyOVERLAPxxx"` – elsherbini Jul 01 '16 at 19:45
@elsherbini I was basing this on your first example but I'd not realised you've specified `size` as 11 in that rather than 7. I thought `size` was supposed to be the length of the overlapping section - what is it? – DaveBensonPhillips Jul 01 '16 at 19:59
size is the length of the merged string. – elsherbini Jul 01 '16 at 20:00
@elsherbini So how is that 11 in the first example? – DaveBensonPhillips Jul 01 '16 at 20:01

Jeff · Answer 2 · 2016-07-01T20:54:55.020

Is this the functionality you were looking for?

def consensus(left, right, ignore_blank_padding=True):
    if ignore_blank_padding:
        left = left.strip()
        right = right.strip()

    slide = len(left) + len(right) - 1

    #slides the strings over each other one spot at a time
    solutions = []
    for i in range(slide):
        lft_test = left[-(i+1):]
        rgt_test = right[:min(len(right), i+1)]
        #print(lft_test, rgt_test)
        if lft_test == rgt_test:
            lft_garbage = left[:-(i+1)]
            rgt_garbage = right[min(len(right), (i+1)):]
            solutions.append((lft_garbage, lft_test, rgt_garbage))

    #if more than one overlap combo is found, keeps only the longest
    if len(solutions) > 1:
        sol_lenghts = [len(i[1]) for i in solutions]                
        longest_index = sol_lenghts.index(max(an_lens))
        solutions = solutions[longest_index]
        return solutions
    elif len(solutions) == 0:
        return None
    else:
        return solutions[0]

left = 'xxxxHEY'
right = 'HEYxx'
consensus(left, right)
> ('xxxx', 'HEY', 'xx')

left = 'xxHEYHEY'
right = 'HEYHEYxxx'
consensus(left, right)
> ('xx', 'HEYHEY', 'xxx')

left = 'xxHEY '
right = '  HEYHEYxxxx'
consensus(left, right)
> ('xx', 'HEY', 'HEYxxxx')

left = 'HEY'
right = '  HEYHEYxxxx'
consensus(left, right)
> ('', 'HEY', 'HEYxxxx')

Left the old answer with the sliding window, but here it is with a specified overlap:

def consensus(left, right, size, ignore_blank_padding=True):
    if ignore_blank_padding:
        left = left.strip()
        right = right.strip()

    solutions = None
    lft_test = left[-(size):]
    rgt_test = right[:size]
    if lft_test == rgt_test:
        lft_garbage = left[:-(size)]
        rgt_garbage = right[min(len(right), (size)):]
        solutions = (lft_garbage, lft_test, rgt_garbage)

    return solutions

left = 'xxxxHEY'
right = 'HEYxx'
consensus(left, right, 3)
> ('xxxx', 'HEY', 'xx')

left = 'xxHEYHEY'
right = 'HEYHEYxxx'
consensus(left, right, 6)
> ('xx', 'HEYHEY', 'xxx')

left = 'xxHEY '
right = '  HEYHEYxxxx'
consensus(left, right, 3)
> ('xx', 'HEY', 'HEYxxxx')

left = 'HEY'
right = '  HEYHEYxxxx'
consensus(left, right, 3)
> ('', 'HEY', 'HEYxxxx')

This looks good if you don't know where the strings overlap. In my case I know that the string starts at the beginning of x, ends at the end of y, and has a certain size, so I know which parts are `lft_test`, `lft_garbage` and so on. — elsherbini, Jul 01 '16 at 19:14
Hmm, I see that. Definitely shouldn't discard information in the process - I think we could just pass a `size` argument into this and modify where the slicing starts. — Jeff, Jul 01 '16 at 19:15
There, added a version for a specified window. They could easily be combined into one function with an optional `size` argument that defaults to `None`. — Jeff, Jul 01 '16 at 21:11

score 0 · Answer 3 · answered Jul 01 '16 at 21:12

Here is a working example, but uses the "way too many if statements" approach which is hard to read, hard to reason about, and extremely inelegant:

def extra_left(x, y, size):
    if size - len(y) > 0:
        return x[:size - len(y)]
    else:
        return ""


def extra_right(x, y, size):
    if size - len(x) > 0:
        return y[len(x) - size:]
    else:
        return ""

def overlap(x, y, size):

    if len(x) < size and len(y) < size:
        x_overlap = x[size - len(y):]
        y_overlap = y[:len(x) - size]
    if len(x) < size and len(y) == size:
        x_overlap = x
        y_overlap = y[:len(x) - size]
    if len(x) < size and len(y) > size:
        x_overlap = x
        y_overlap = y[len(y)-size:size]

    if len(x) == size and len(y) < size:
        x_overlap = x[size - len(y):]
        y_overlap = y
    if len(x) == size and len(y) == size:
        x_overlap = x
        y_overlap = y
    if len(x) == size and len(y) > size:
        x_overlap = x
        y_overlap = y[len(y) - size:]

    if len(x) > size and len(y) < size:
        x_overlap = x[size - len(y):size]
        y_overlap = y
    if len(x) > size and len(y) == size:
        x_overlap = x[:size]
        y_overlap = y
    if len(x) > size and len(y) > size:
        x_overlap = x[:size]
        y_overlap = y[-size:]

    if len(x) + len(y) < size:
        raise RuntimeError("x and y do not overlap with this size")

    return consensus(x_overlap, y_overlap)

def consensus(x, y):
    assert len(x) == len(y)
    return x


def merge(x, y, size):
    return extra_left(x, y, size) + overlap(x, y, size) + extra_right(x, y, size)

Here are some unit tests (using pytest)

class Tests:

    def test1(self):
        """
        len(x) < size and len(y) < size:
        xxxOVERLAP---
        ---OVERLAPyyy
        # expected output: xxx + consensus(xOVERLAP, yOVERLAP) + yyy
        """
        x = "AAAATTTTTTT"
        y = "TTTTTTTCCC"
        size = 14
        assert merge(x, y, size) == "AAAA" + consensus("TTTTTTT", "TTTTTTT") + "CCC"

    def test2(self):
        """
        if len(x) < size and len(y) == size:
        # size=10
        OVERLAP---
        OVERLAPyyy
        # expected output: consensus(xOVERLAP, yOVERLAP) + yyy
        """
        x = "TTTTTTT"
        y = "TTTTTTTCCC"
        size = 10
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT") + "CCC"

    def test3(self):
        """
        if len(x) < size and len(y) > size:
        ---OVERLAP---
        yyyOVERLAPyyy
        """
        x = "TTTTTTT"
        y = "CCCTTTTTTTCCC"
        size = 10
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT") + "CCC"

    def test4(self):
        """
        if len(x) == size and len(y) < size:
        # size=10 
        xxxOVERLAP
        ---OVERLAP
        # expected output: xxx + consensus(xOVERLAP, yOVERLAP)
        """
        x = "AAATTTTTTT"
        y = "TTTTTTT"
        size = 10
        assert merge(x, y, size) == "AAA" + consensus("TTTTTTT", "TTTTTTT")

    def test5(self):
        """
        if len(x) == size and len(y) == size:
        # size=7
        OVERLAP
        OVERLAP
        # expected output: consensus(xOVERLAP, yOVERLAP)
        """
        x = "TTTTTTT"
        y = "TTTTTTT"
        size = 7
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT")

    def test6(self):
        """
        if len(x) == size and len(y) > size:
        # size=10
        --xxxOVERLAP
        yyyyyOVERLAP
        # expected output: consensus(xOVERLAP, yOVERLAP)
        """
        x = "AAATTTTTTT"
        y = "CCCCCTTTTTTT"
        size = 10
        assert merge(x, y, size) == "AAA" + consensus("TTTTTTT", "TTTTTTT")

    def test7(self):
        """
        if len(x) > size and len(y) < size:
        xxxOVERLAPxxx
        ---OVERLAP---
        """
        x = "AAATTTTTTTAAA"
        y = "TTTTTTT"
        size = 10
        assert merge(x, y, size) == "AAA" + consensus("TTTTTTT", "TTTTTTT")

    def test8(self):
        """
        if len(x) > size and len(y) == size:
        ---OVERLAPxxx
        ---OVERLAP---
        """
        x = "TTTTTTTAAA"
        y = "TTTTTTT"
        size = 7
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT")

    def test9(self):
        """
        if len(x) > size and len(y) > size:
        ---OVERLAPxxx
        yyyOVERLAP---
        # expected output: consensus(xOVERLAP, yOVERLAP)
        """
        x = "TTTTTTTAAA"
        y = "CCCTTTTTTT"
        size = 7
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT")


    def test_error(self):
        """
        # no overlap, produce error:
        xxx---
        ---yyy
        # expected output: error
        """
        x = "AAA"
        y = "TTT"
        size = 7
        with pytest.raises(RuntimeError):
            merge(x, y, size)

And they all pass:

test_merge.py::Tests::test1 PASSED
test_merge.py::Tests::test2 PASSED
test_merge.py::Tests::test3 PASSED
test_merge.py::Tests::test4 PASSED
test_merge.py::Tests::test5 PASSED
test_merge.py::Tests::test6 PASSED
test_merge.py::Tests::test7 PASSED
test_merge.py::Tests::test8 PASSED
test_merge.py::Tests::test9 PASSED
test_merge.py::Tests::test_error PASSED

====================================================================== 10 passed in 0.02 seconds =======================================================================

Pythonic way to "merge" strings, handling all possible lengths

3 Answers3