20

I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:

ABCDE
ABCDEFG
ABCDEFGH
ABCDEFGHIJKLMNO
CEST
DBTSFDE
DBTSFDEO
EOEUDNBNUW
EOEUDNBNUWD
EAEUDNBNUW
FEOEUDNBNUW
FG
FGH

I would like to remove those shorter overlap and just keep the longest one so the desired output would look like this:

ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EAEUDNBNUW
FEOEUDNBNUWD
FGH

How can I do it? My code looks like this:

with open('toy.txt' ,'r') as f:
    pattern = f.read().splitlines()
    print pattern

    for i in range(0, len(pattern)):
        if pattern[i] in pattern[i+1]:
            pattern.remove(pattern[i])
        print pattern

And I got the error message:

['ABCDE', 'ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    if pattern[i] in pattern[i+1]:
IndexError: list index out of range
scharette
  • 9,437
  • 8
  • 33
  • 67
Kenny
  • 221
  • 2
  • 5
  • what troubles do you have? – Brown Bear Jul 13 '18 at 14:49
  • 3
    Do the proteins always start the same or they may overlap elsewhere? Is your file sorted? These assumptions could lead to a more efficient solution if they are true – Chris_Rands Jul 13 '18 at 14:51
  • @Chris_Rands : I have sorted the list before, so i was thinking to compare item 1 and item 2, and then item 2 and item 3 and so on... – Kenny Jul 13 '18 at 14:54
  • By 'overlap', do you mean 'contains' or 'starts with'? – DSM Jul 13 '18 at 15:11
  • @DSM : I mean contains. sometimes sequences are "ABCD", "ABCDE", but could also be "EABCD" – Kenny Jul 13 '18 at 15:23
  • 5
    @Kenny: okay, that's trickier, because then your sort won't put ABCD and EABCD next to each other, so I think some of the answers won't help much. – DSM Jul 13 '18 at 15:26
  • 2
    @DSM I agree with you. Considering OP did not mention that though I think the question (and answers) stays valid and really interesting. Unfortunately it won't apply to his specific problem, but will most likely apply to futur users. – scharette Jul 13 '18 at 15:33
  • @scharette : yeah it's my bad, I didn't realize some sequences have extra letters before the sequence – Kenny Jul 13 '18 at 15:35
  • I recommend CD-HIT with 100% id threshold for this task. This C++ implementation will likely be faster than any Python implementation https://github.com/weizhongli/cdhit – Chris_Rands Jul 13 '18 at 15:38
  • How long are your sequences? (Obviously short enough that you can read 59k of them into memory at once.) – DSM Jul 13 '18 at 15:39
  • @DSM : They vary in length. My shortest sequence is 3 characters long, and longest being 142 characters long – Kenny Jul 13 '18 at 15:51
  • 1
    @Kenny Now that I think about it. i'm not sure if you understand what we mean. In the case you would have "ABCD" and "DCBA", which one would you keep ? because you want the shorter version but these two are the same length ? So, would you keep both ? – scharette Jul 13 '18 at 15:51
  • @scharette : In my sequences, I have seen ABCD, ABCDR, ABCDRS, UEABCD, BUEABCD etc. (sorry I am not allowed to expose the actual sequences due to confidentiality) but the idea is from the above 5 examples, I can say that the consensus sequence is BUEABCDRS because all the examples match 100% to the consensus. – Kenny Jul 13 '18 at 15:58
  • @Kenny Ok, you'll need a more complex algorithm for your specific case. But, you're still not answering what to do in the case where you would have "ABC" and "CBA". Which one is favored ? Also, the example you used to clarify the problem is specifically misleading since it is still sorted and could possibly be neighbors in a sorted list. – scharette Jul 13 '18 at 16:02
  • @Kenny I edited your question to its original state (the one that got upvoted). In the future please be more specific when asking questions. If you need an answer to your specific problem, feel free to open a new thread but this time being clear. – scharette Jul 13 '18 at 16:09
  • @scharette : thank you. I am new to SO as well, didn't know the rules :) To answer your question, yeah this is a very complicated task. There is very slim chance to see ABC and CBA. if that's the case, I'd prefer to keep both patterns. because for my next step, I need to match all these patterns to my protein sequences and to find the locations (this part I have the script ready) – Kenny Jul 13 '18 at 16:17
  • can you sort the sequences themselves, at least for overall list sorting purposes? i.e. `"ABCD", "ABCDE", "EABCD"` becomes `"ABCD", "ABCDE", "ABCDE"` ? – JL Peyret Jul 14 '18 at 00:06
  • @Kenny Its fine, we are all learning. Sorry I couldn't help you with your specific problem. I suggest you read my answer though it still underlines an important mistake you did in your original code and could be beneficial for you to understand what this is. – scharette Jul 14 '18 at 00:20
  • Notice, `FGH` overlaps with `ABCDEFGHIJKLMNO`. Are you sure your expected results are correct? – pylang Jul 14 '18 at 17:30
  • Also `FEOEUDNBNUWD` is not one of your original sequences. Clarify whether you want post-concatenation of overlapping sequences. – pylang Jul 14 '18 at 17:36

11 Answers11

15

There is other working answers, but none of them explain your actual problem. you were actually really close of a valid solution and what is, in my opinion, the most readable answer.

The error came from the fact that you were mutating the same list while checking for index using range().

Thus, while increasing the i variable you were removing item from the list which at one point causes the index error inevitably.

Therefore, here is a working version of your initial code with some changes,

pattern = ["ABCDE","ABCDEFG","ABCDEFGH","ABCDEFGHIJKLMNO","CEST","DBTSFDE","DBTSFDEO","EOEUDNBNUW","EAEUDNBNUW","FG","FGH"]
output_pattern = []


for i in range(0, (len(pattern)-1)):
    if not pattern[i] in pattern[i+1]:
        output_pattern.append(pattern[i]) 

# Adding the last item
output_pattern.append(pattern[-1])   
print (output_pattern)

>>>> ['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']    

Note that this code will work if your list is previously sorted as you mentioned in comment section.

What is this code doing ?

Basically, it use the same logic of your initial answer where it iterates on the list and check if the next item contains the current item. But, using another list and iterating until the before last item, will fix your index problem. But now comes a question,

What should I do with the last item ?

Since the list is sorted, you can consider the last item as always being unique. This is why I'm using

output_pattern.append(pattern[-1])

which adds the last item of the initial list.

Important note

This answer was written in response to OP's initial question where he wanted to keep the longer overlap and I quote based on the next item in same list. As stated by @Chris_Rands if your concerns are related to a biological task and need to find any overlap, this solution is not suited for your needs.

Example where this code would fail to recognize a potential overlap,

pattern = ["ACD", "AD", "BACD"]

where it would output the same result without removing the possible "ACD" overlap. Now, just as a clarification though, this would imply a much more complex algorithm and I initially thought it was out of the scope of the question's requirements. If ever this is your case, I may be completely wrong here, but I truly think a C++ implementation seems more appropriate. have a look at the CD-Hit algorithm suggested by @Chris_Rands in the comment section.

scharette
  • 9,437
  • 8
  • 33
  • 67
  • It's not fully appropriate, because e.g. 'abcd' in 'babcd' -> True but they are different. – Rob Jul 15 '18 at 11:36
  • @Rob No. This is what OP wanted. The question was to keep the biggest overlap if it _contains_ the next one. Therefore, considering `"ABCD"` and `"BABCD"` the code should keep the former. This is what it is doing. – scharette Jul 15 '18 at 11:41
  • You can't only consider `i+1`, for example this fails for `pattern = ['ACD', 'AD', 'BACD']` – Chris_Rands Jul 17 '18 at 09:26
  • @chris_rands Please read the question's comments section. I've discussed this issue with OP. His initial question was to base his search and I quote _on the next item of the list_. Therefore, this is what I did. He kept on changing the requirements, so I decided to keep the original state of the question which people up-voted. – scharette Jul 17 '18 at 10:01
  • @scharette yes i understand it's not your fault, the question was unclear initially, but if you understand the biological issue that this is meant to address then I can't imagine a use case for your solution, and it might introduce a bug – Chris_Rands Jul 17 '18 at 10:08
  • 2
    @Chris_Rands I understand your concern. I'll add a note for future users. – scharette Jul 17 '18 at 12:09
  • @Chris_Rands I have clarified my questions in a new post https://stackoverflow.com/questions/51406699/python-consolidate-similar-patterns-into-single-consensus-pattern – Kenny Jul 18 '18 at 16:41
  • @scharette Please check my new thread https://stackoverflow.com/questions/51406699/python-consolidate-similar-patterns-into-single-consensus-pattern – Kenny Jul 18 '18 at 16:41
5

You could use groupby() and max() to help here:

from itertools import groupby

with open('toy.txt') as f_input:
    for key, group in groupby(f_input, lambda x: x[:2]):
        print(max(group, key=lambda x: len(x)).strip())

This would display:

ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EOEUDNBNUW
EAEUDNBNUW
FGH

groupby() works by returning a list of matching items based on a function, in this case consecutive lines with the same first 2 characters. The max() function then takes this list and returns the list item with the longest length.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • 2
    They do not want to merely group on the first 2 characters, they want to group based on one string containing another – Chris_Rands Jul 13 '18 at 15:29
4
# assuming list is sorted:
pattern = ["ABCDE",
"ABCDEFG",
"ABCDEFGH",
"ABCDEFGHIJKLMNO",
"CEST",
"DBTSFDE",
"DBTSFDEO",
"EOEUDNBNUW",
"EAEUDNBNUW",
"FG",
"FGH"]

pattern = list(reversed(pattern))

def iterate_patterns():
    while pattern:
        i = pattern.pop()
        throw_it_away = False
        for p in pattern:
            if p.startswith(i):
                throw_it_away = True
                break
        if throw_it_away == False:
            yield i

print(list(iterate_patterns()))

Output:

['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
1
with open('demo.txt') as f:
    lines = f.readlines()

l_lines = len(lines)

n_lst = []

for i, line in enumerate(lines):
    line = line.strip()
    if i == l_lines - 1:
        if lines[-2] not in line:
            n_lst.append(line)
        break
    if line not in lines[i + 1]:
        n_lst.append(line)

print(n_lst)

Output

['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']
Druta Ruslan
  • 7,171
  • 2
  • 28
  • 38
1

This will get you where you want to be:

with open('toy.txt' ,'r') as f:
    lines = f.readlines()
    data = set(lines)
    print(sorted([i for i in lines if len([j for j in data if j.startswith(i)])==1]))

#['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EAEUDNBNUW', 'EOEUDNBNUW', 'FGH']

I've added set just in case of multiple occurrences of same text.

zipa
  • 27,316
  • 6
  • 40
  • 58
1

A simple way is to process the input file one line at a time, compare each line with the previous one and keep previous one if it is not contained in current one.

Code can be as simple as:

with open('toy.txt' ,'r') as f:
    old = next(f).strip()               # keep first line after stripping EOL 

    for pattern in f:
        pattern = pattern.strip()       # strip end of line...
        if old not in pattern:
            print old                   # keep old if it is not contained in current line
        old = pattern                   # and store current line for next iteration
    print old                           # do not forget last line
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
1

As stated in other answers, your error comes from calculating the length of your input at the start and then not updating it as you shorten the list.

Here's another take at a working solution:

with open('toy.txt', 'r') as infile:
    input_lines = reversed(map(lambda s: s.strip(), infile.readlines()))

output = []
for pattern in input_lines:
    if len(output) == 0 or not output[-1].startswith(pattern):        
        output.append(pattern)

print('\n'.join(reversed(output)))
Woodford
  • 3,746
  • 1
  • 15
  • 29
1

Not an exact match with your expectations, but, given that you state it's sorted (and it's not, near EOEUDNBNUWD EAEUDNBNUW) and that I don't know why you're missing EOEUDNBNUWD I am not sure if your expectations are correctly stated or if I've misread your question.

(ah, yes, I see the notion of overlap throws a wrench into the sort and startswith approach).

Might be nice for the OP to restate that particular aspect, I read @DSM comment without really understanding his concern. Now I do.

li = sorted([i.strip() for i in """
ABCDE
ABCDEFG
ABCDEFGH
ABCDEFGHIJKLMNO
CEST
DBTSFDE
DBTSFDEO
EOEUDNBNUW
EOEUDNBNUWD
EAEUDNBNUW
FEOEUDNBNUW
FG
FGH""".splitlines() if i.strip()])

def get_iter(li):
    prev = ""
    for i in li:
        if not i.startswith(prev):
            yield(prev)
        prev = i
    yield prev

for v in get_iter(li):
    print(v)

output:

ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EAEUDNBNUW
EOEUDNBNUWD
FEOEUDNBNUW
FGH
JL Peyret
  • 10,917
  • 2
  • 54
  • 73
1

Code

import collections as ct


def read_file(filepath):
    """Yield a generator of lines from a file."""
    with open(filepath, "r") as f:
        for line in f:
            yield line.strip()


def find_longest_sequences(seqs):
    """Return a dict of the long common sequences."""
    seqs = tuple(seqs) 
    dd = ct.defaultdict(list)
    [dd[k].append(seq) for seq in seqs for k in seqs if k in seq]
    return {max(v, key=len) for v in dd.values()}


data = read_file("test.txt")
find_longest_sequences(data)

Output

{'ABCDEFGHIJKLMNO',
 'CEST',
 'DBTSFDEO',
 'EAEUDNBNUW',
 'EOEUDNBNUWD',
 'FEOEUDNBNUW'}

Details

We use read_file to yield each line of the file.

find_longest_sequences builds a defaultdict that groups similar sequences together. It iterates the data with two loops:

  1. The first loop builds a dict of empty lists with unique sequences as keys.
  2. The second loop appends as values any strings that are similar to the key.

A set of the values is made of the resulting dict, and the longest sequences are returned.

Note some discrepancies with your expected output:

  1. FGH overlaps with ABCDEFGHIJKLMNO and is thus not a valid output.
  2. FEOEUDNBNUWD is not an original sequence. Post-processing is needed for overlapping sequences.
pylang
  • 40,867
  • 14
  • 129
  • 121
1

Kenny, You almost got it, but there are two problems which @scharette pointed out:

  1. for loop and removing of list item should not go together. The fix is to use the while loop and explicitly increase the index. The while loop is less efficient because it calls len() several times instead once, but that's what it take to get the correct result.
  2. The IndexError. This only happens at the very last line. My way to deal with this problem is to ignore the error.

With that, I modified your code to:

with open('toy.txt' ,'r') as f:
    pattern = f.read().splitlines()
    print pattern

    try:
        i = 0
        while i < len(pattern):
            if pattern[i] in pattern[i+1]:
                pattern.remove(pattern[i])
            print pattern
            i += 1
    except IndexError:
        pass
Hai Vu
  • 37,849
  • 11
  • 66
  • 93
-1

You can use a binary tree whose insertion process attempts to find nodes that precede the value:

class Tree:
  def __init__(self, val=None):
    self.left, self.value, self.right = None, val, None
  def insert_val(self, _val):
    if self.value is None or _val.startswith(self.value):
       self.value = _val
    else:
       if _val < self.value:
          getattr(self.left, 'insert_val', lambda x:setattr(self, 'left', Tree(x)))(_val)
       else:
          getattr(self.right, 'insert_val', lambda x:setattr(self, 'right', Tree(x)))(_val)
  def flatten(self):
     return [*getattr(self.left, 'flatten', lambda :[])(), self.value, *getattr(self.right, 'flatten', lambda :[])()]

t = Tree()
for i in open('filename.txt'):
  t.insert_val(i.strip('\n'))
print(t.flatten())

Output:

['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EAEUDNBNUW', 'EOEUDNBNUW', 'FGH']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102