0

I'm trying to solve a fairly common problem in bioinformatics without resorting to a bunch of if statements.

The problem at hand:

I'm given two overlapping strings and a length of the expected output, and I want to produce a merged string. Here all the ways the strings might overlap: (in the following examples a - denotes that there is nothing in that string at that position. the consensus() bit is explained after the examples.):

# size=13
xxxOVERLAP---
---OVERLAPyyy
# expected output: xxx + consensus(xOVERLAP, yOVERLAP) + yyy


# size=7
---OVERLAPxxx
yyyOVERLAP---
# expected output: consensus(xOVERLAP, yOVERLAP)


# size=7
OVERLAP
OVERLAP
# expected output: consensus(xOVERLAP, yOVERLAP)

# size=10
xxxOVERLAP
---OVERLAP
# expected output: xxx + consensus(xOVERLAP, yOVERLAP)

# size=10
OVERLAP---
OVERLAPyyy
# expected output: consensus(xOVERLAP, yOVERLAP) + yyy

# size > len(x) + len(y)
# no overlap, produce error:
xxx---
---yyy
# expected output: error

The resulting merged string needs to start with the beginning of x and end with the end of y. The region that overlaps need to be passed to another function, consensus() that deals with merging the overlapped region. Here all the ways the strings might overlap: (in the following examples a - denotes that there is nothing in that string at that position)

def merge(x, y, size):
    # do the mergeing
    return part of x that doesn't overlap + consensus(overlap) + part of y that doesn't overlap.

I can code up a mess of if statements to recognize each case and deal with it individually, but I've been struggling to find a more elegant solution. One approach I considered is padding the strings (the end of x and the beginning of y) so that all cases look like the second example, but this seems too inefficient to be palatable, since i'd be making new strings when i did that and I'm applying this function to millions of strings.

tripleee
  • 175,061
  • 34
  • 275
  • 318
elsherbini
  • 1,596
  • 13
  • 23

3 Answers3

0

I would start with a generator that yields each character:

def merge_gen(x, y, overhang):
    buffer = ' ' * overhang
    for s in map(set, zip(buffer + x, y + buffer)):
        yield max(s)

Where overhang is len(x) - size (see below)

That works as follows:

>>> list(merge_gen('OVERLAPXXX', 'YYYOVERLAP', 3))
['Y', 'Y', 'Y', 'O', 'V', 'E', 'R', 'L', 'A', 'P', 'X', 'X', 'X']

You can then implement a merge function including the consensus function as follows:

def merge(x, y, size):
    length = len(x)
    overhang = size - length
    overlap = length - overhang
    gen = merge_gen(x, y, overhang)

    result = ''
    result += ''.join(next(gen) for _ in range(overhang))
    result += consensus(''.join(next(gen) for _ in range(overlap)))
    result += ''.join(next(gen) for _ in range(overhang))
    return result

I would hope that this is reasonably efficient in Python3; lots of generators, few wasted strings to be discarded, etc.

(*) Apparently this is a fast way of getting a single item from a set. In this case we know the set only has one element, and we just want to extract.

Community
  • 1
  • 1
DaveBensonPhillips
  • 3,134
  • 1
  • 20
  • 32
0

Is this the functionality you were looking for?

def consensus(left, right, ignore_blank_padding=True):
    if ignore_blank_padding:
        left = left.strip()
        right = right.strip()

    slide = len(left) + len(right) - 1

    #slides the strings over each other one spot at a time
    solutions = []
    for i in range(slide):
        lft_test = left[-(i+1):]
        rgt_test = right[:min(len(right), i+1)]
        #print(lft_test, rgt_test)
        if lft_test == rgt_test:
            lft_garbage = left[:-(i+1)]
            rgt_garbage = right[min(len(right), (i+1)):]
            solutions.append((lft_garbage, lft_test, rgt_garbage))

    #if more than one overlap combo is found, keeps only the longest
    if len(solutions) > 1:
        sol_lenghts = [len(i[1]) for i in solutions]                
        longest_index = sol_lenghts.index(max(an_lens))
        solutions = solutions[longest_index]
        return solutions
    elif len(solutions) == 0:
        return None
    else:
        return solutions[0]

left = 'xxxxHEY'
right = 'HEYxx'
consensus(left, right)
> ('xxxx', 'HEY', 'xx')

left = 'xxHEYHEY'
right = 'HEYHEYxxx'
consensus(left, right)
> ('xx', 'HEYHEY', 'xxx')

left = 'xxHEY '
right = '  HEYHEYxxxx'
consensus(left, right)
> ('xx', 'HEY', 'HEYxxxx')

left = 'HEY'
right = '  HEYHEYxxxx'
consensus(left, right)
> ('', 'HEY', 'HEYxxxx')

Left the old answer with the sliding window, but here it is with a specified overlap:

def consensus(left, right, size, ignore_blank_padding=True):
    if ignore_blank_padding:
        left = left.strip()
        right = right.strip()

    solutions = None
    lft_test = left[-(size):]
    rgt_test = right[:size]
    if lft_test == rgt_test:
        lft_garbage = left[:-(size)]
        rgt_garbage = right[min(len(right), (size)):]
        solutions = (lft_garbage, lft_test, rgt_garbage)

    return solutions

left = 'xxxxHEY'
right = 'HEYxx'
consensus(left, right, 3)
> ('xxxx', 'HEY', 'xx')

left = 'xxHEYHEY'
right = 'HEYHEYxxx'
consensus(left, right, 6)
> ('xx', 'HEYHEY', 'xxx')

left = 'xxHEY '
right = '  HEYHEYxxxx'
consensus(left, right, 3)
> ('xx', 'HEY', 'HEYxxxx')

left = 'HEY'
right = '  HEYHEYxxxx'
consensus(left, right, 3)
> ('', 'HEY', 'HEYxxxx')
Jeff
  • 2,158
  • 1
  • 16
  • 29
  • This looks good if you don't know where the strings overlap. In my case I know that the string starts at the beginning of x, ends at the end of y, and has a certain size, so I know which parts are `lft_test`, `lft_garbage` and so on. – elsherbini Jul 01 '16 at 19:14
  • Hmm, I see that. Definitely shouldn't discard information in the process - I think we could just pass a `size` argument into this and modify where the slicing starts. – Jeff Jul 01 '16 at 19:15
  • There, added a version for a specified window. They could easily be combined into one function with an optional `size` argument that defaults to `None`. – Jeff Jul 01 '16 at 21:11
0

Here is a working example, but uses the "way too many if statements" approach which is hard to read, hard to reason about, and extremely inelegant:

def extra_left(x, y, size):
    if size - len(y) > 0:
        return x[:size - len(y)]
    else:
        return ""


def extra_right(x, y, size):
    if size - len(x) > 0:
        return y[len(x) - size:]
    else:
        return ""

def overlap(x, y, size):

    if len(x) < size and len(y) < size:
        x_overlap = x[size - len(y):]
        y_overlap = y[:len(x) - size]
    if len(x) < size and len(y) == size:
        x_overlap = x
        y_overlap = y[:len(x) - size]
    if len(x) < size and len(y) > size:
        x_overlap = x
        y_overlap = y[len(y)-size:size]

    if len(x) == size and len(y) < size:
        x_overlap = x[size - len(y):]
        y_overlap = y
    if len(x) == size and len(y) == size:
        x_overlap = x
        y_overlap = y
    if len(x) == size and len(y) > size:
        x_overlap = x
        y_overlap = y[len(y) - size:]

    if len(x) > size and len(y) < size:
        x_overlap = x[size - len(y):size]
        y_overlap = y
    if len(x) > size and len(y) == size:
        x_overlap = x[:size]
        y_overlap = y
    if len(x) > size and len(y) > size:
        x_overlap = x[:size]
        y_overlap = y[-size:]

    if len(x) + len(y) < size:
        raise RuntimeError("x and y do not overlap with this size")

    return consensus(x_overlap, y_overlap)

def consensus(x, y):
    assert len(x) == len(y)
    return x


def merge(x, y, size):
    return extra_left(x, y, size) + overlap(x, y, size) + extra_right(x, y, size)

Here are some unit tests (using pytest)

class Tests:

    def test1(self):
        """
        len(x) < size and len(y) < size:
        xxxOVERLAP---
        ---OVERLAPyyy
        # expected output: xxx + consensus(xOVERLAP, yOVERLAP) + yyy
        """
        x = "AAAATTTTTTT"
        y = "TTTTTTTCCC"
        size = 14
        assert merge(x, y, size) == "AAAA" + consensus("TTTTTTT", "TTTTTTT") + "CCC"

    def test2(self):
        """
        if len(x) < size and len(y) == size:
        # size=10
        OVERLAP---
        OVERLAPyyy
        # expected output: consensus(xOVERLAP, yOVERLAP) + yyy
        """
        x = "TTTTTTT"
        y = "TTTTTTTCCC"
        size = 10
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT") + "CCC"

    def test3(self):
        """
        if len(x) < size and len(y) > size:
        ---OVERLAP---
        yyyOVERLAPyyy
        """
        x = "TTTTTTT"
        y = "CCCTTTTTTTCCC"
        size = 10
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT") + "CCC"

    def test4(self):
        """
        if len(x) == size and len(y) < size:
        # size=10 
        xxxOVERLAP
        ---OVERLAP
        # expected output: xxx + consensus(xOVERLAP, yOVERLAP)
        """
        x = "AAATTTTTTT"
        y = "TTTTTTT"
        size = 10
        assert merge(x, y, size) == "AAA" + consensus("TTTTTTT", "TTTTTTT")

    def test5(self):
        """
        if len(x) == size and len(y) == size:
        # size=7
        OVERLAP
        OVERLAP
        # expected output: consensus(xOVERLAP, yOVERLAP)
        """
        x = "TTTTTTT"
        y = "TTTTTTT"
        size = 7
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT")

    def test6(self):
        """
        if len(x) == size and len(y) > size:
        # size=10
        --xxxOVERLAP
        yyyyyOVERLAP
        # expected output: consensus(xOVERLAP, yOVERLAP)
        """
        x = "AAATTTTTTT"
        y = "CCCCCTTTTTTT"
        size = 10
        assert merge(x, y, size) == "AAA" + consensus("TTTTTTT", "TTTTTTT")

    def test7(self):
        """
        if len(x) > size and len(y) < size:
        xxxOVERLAPxxx
        ---OVERLAP---
        """
        x = "AAATTTTTTTAAA"
        y = "TTTTTTT"
        size = 10
        assert merge(x, y, size) == "AAA" + consensus("TTTTTTT", "TTTTTTT")

    def test8(self):
        """
        if len(x) > size and len(y) == size:
        ---OVERLAPxxx
        ---OVERLAP---
        """
        x = "TTTTTTTAAA"
        y = "TTTTTTT"
        size = 7
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT")

    def test9(self):
        """
        if len(x) > size and len(y) > size:
        ---OVERLAPxxx
        yyyOVERLAP---
        # expected output: consensus(xOVERLAP, yOVERLAP)
        """
        x = "TTTTTTTAAA"
        y = "CCCTTTTTTT"
        size = 7
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT")


    def test_error(self):
        """
        # no overlap, produce error:
        xxx---
        ---yyy
        # expected output: error
        """
        x = "AAA"
        y = "TTT"
        size = 7
        with pytest.raises(RuntimeError):
            merge(x, y, size)

And they all pass:

test_merge.py::Tests::test1 PASSED
test_merge.py::Tests::test2 PASSED
test_merge.py::Tests::test3 PASSED
test_merge.py::Tests::test4 PASSED
test_merge.py::Tests::test5 PASSED
test_merge.py::Tests::test6 PASSED
test_merge.py::Tests::test7 PASSED
test_merge.py::Tests::test8 PASSED
test_merge.py::Tests::test9 PASSED
test_merge.py::Tests::test_error PASSED

====================================================================== 10 passed in 0.02 seconds =======================================================================
elsherbini
  • 1,596
  • 13
  • 23