7

I want to remove an element from list, such that the element contains 'X' or 'N'. I have to apply for a large genome. Here is an example:

input:

codon=['AAT','XAC','ANT','TTA']

expected output:

codon=['AAT','TTA']  
SilentGhost
  • 307,395
  • 66
  • 306
  • 293
bharathy
  • 113
  • 1
  • 3
  • 2
    When you looked in the Python reference pages for "remove" from "list", what did you find? Anything? Please check the Python docs, and then update your question with something specific. – S.Lott Dec 08 '09 at 11:29
  • Hint: Start your search here: http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-buffer-xrange – S.Lott Dec 08 '09 at 11:30
  • I have edited the question title, it seems some people just saw "Removing an element from a list" and did not read further – Ben James Dec 08 '09 at 11:33

10 Answers10

7

For basis purpose

>>> [x for x in ['AAT','XAC','ANT','TTA'] if "X" not in x and "N" not in x]
['AAT', 'TTA']

But if you have huge amount of data, I suggest you to use dict or set

And If you have many characters other than X and N, you may do like this

>>> [x for x in ['AAT','XAC','ANT','TTA'] if not any(ch for ch in list(x) if ch in ["X","N","Y","Z","K","J"])]
['AAT', 'TTA']

NOTE: list(x) can be just x, and ["X","N","Y","Z","K","J"] can be just "XNYZKJ", and refer gnibbler answer, He did the best one.

YOU
  • 120,166
  • 34
  • 186
  • 219
4

Another not fastest way but I think it reads nicely

>>> [x for x in ['AAT','XAC','ANT','TTA'] if not any(y in x for y in "XN")]
['AAT', 'TTA']

>>> [x for x in ['AAT','XAC','ANT','TTA'] if not set("XN")&set(x)]
['AAT', 'TTA']

This way will be faster for long codons (assuming there is some repetition)

codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
    if s not in memo:
        memo[s]=not any(y in s for y in "XN")
    return memo[s]

print filter(pred,codon)

Here is the method suggested by James Brooks, you'd have to test to see which is faster for your data

codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
    if s not in memo:
        memo[s]= not set("XN")&set(s)
    return memo[s]

print filter(pred,codon)

For this sample codon, the version using sets is about 10% slower

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • You always have shortest one, :-) +1 – YOU Dec 08 '09 at 11:47
  • Nice ! The code for `pred` could make use of the built-in dict.setdefault method (as examplified in my answer); I would guess that the setdefault version is faster. – Eric O. Lebigot Dec 08 '09 at 13:59
  • @EOL, sure it's worth trying, but remember that the second param of setdefault is evaluated everytime regardless. – John La Rooy Dec 08 '09 at 20:15
  • It's purely a nitpick at this point, but you can remove the redundant `return` statements by changing the test to `if not s is memo` and moving `memo[s]=` into its block. – Ben Blank Dec 08 '09 at 23:14
  • @Ben, Thanks, I've changed it as it actually seems to be a tiny bit faster – John La Rooy Dec 09 '09 at 00:05
3

There is also the method of doing it using filter

    lst = filter(lambda x: 'X' not in x and 'N' not in x, list)
Yacoby
  • 54,544
  • 15
  • 116
  • 120
2
  1. filter(lambda x: 'N' not in x or 'X' not in x, your_list)
  2. your_list = [x for x in your_list if 'N' not in x or 'X' not in x]
Vasiliy Stavenko
  • 1,174
  • 1
  • 12
  • 29
  • The condition here is wrong. It should either be `'N' not in x and 'X' not in x`, OR `not ('N' in x or 'X' in x)`. – Lee D Apr 19 '12 at 01:34
1

Any reason for duplicating the entire list? How about:

>>> def pred(item, haystack="XN"):
...     return any(needle in item for needle in haystack)
...
>>> lst = ['AAT', 'XAC', 'ANT', 'TTA']
>>> idx = 0
>>> while idx < len(lst):
...     if pred(lst[idx]):
...         del lst[idx]
...     else:
...         idx = idx + 1
...
>>> lst
['AAT', 'TTA']

I know that list comprehensions are all the rage these days, but if the list is long we don't want to duplicate it without any reason right? You can take this to the next step and create a nice utility function:

>>> def remove_if(coll, predicate):
...     idx = len(coll) - 1
...     while idx >= 0:
...         if predicate(coll[idx]):
...             del coll[idx]
...         idx = idx - 1
...     return coll
...
>>> lst = ['AAT', 'XAC', 'ANT', 'TTA']
>>> remove_if(lst, pred)
['AAT', 'TTA']
>>> lst
['AAT', 'TTA']
D.Shawley
  • 58,213
  • 10
  • 98
  • 113
  • This is the way a C developer would think about the problem. By using Python, in particular lists, you're already sacrificing a lot of efficiency so why would you care about modifying the list in-place? – Ben James Dec 08 '09 at 14:05
  • Actually, a C++ developer... I think that modifying it in-place is what the OP asked for. I usually care about efficiency even if the data set is large as well since large memory duplication usually means hours instead of minutes or running out of memory altogether if you are dealing with large data sets. Not to mention that list comprehensions and a slew of `not` and `and` together hide what you are really doing. Instead of _removing a few items from a list_ you see _duplicating a list without items_. I guess I prefer simplicity and clarity. – D.Shawley Dec 08 '09 at 14:52
  • 1
    Should be `idx = idx - 1` in the second example. The reason for using list comprehensions or filter() is that the loop is in C, so usually much faster. – John La Rooy Dec 08 '09 at 20:30
  • Thanks gnibbler. I never thought about the fact that comprehensions and built-ins will loop in native code instead. Interesting thought. – D.Shawley Dec 09 '09 at 00:30
  • Actually, this approach is `O(n^2)`, while getting a new filtered list and replacing the content of the old list with the new would be `O(n)`. – ead Sep 09 '19 at 19:34
1

I like gnibbler’s memoization approach a lot. Either method using memoization should be identically fast in the big picture on large data sets, as the memo dictionary should quickly be filled and the actual test should be rarely performed. With this in mind, we should be able to improve the performance even more for large data sets. (This comes at some cost for very small ones, but who cares about those?) The following code only has to look up an item in the memo dict once when it is present, instead of twice (once to determine membership, another to extract the value).

codon = ['AAT', 'XAC', 'ANT', 'TTA']
def pred(s,memo={}):
    try:
        return memo[s]
    except KeyError:
        memo[s] = not any(y in s for y in "XN")
    return memo[s]

filtered = filter(pred, codon)

As I said, this should be noticeably faster when the genome is large (or at least not extremely small).

If you don’t want to duplicate the list, but just iterate over the filtered list, do something like:

for item in (item for item in codon if pred):
    do_something(item)
musicinmybrain
  • 621
  • 4
  • 8
  • The try… except structure that you use is quite equivalent to the built-in dict.setdefault (see my answer for an example). – Eric O. Lebigot Dec 08 '09 at 14:05
  • There’s actually a subtle difference: with setdefault the “default” expression is evaluated no matter what! This expression contains the slow predicate we are attempting to avoid, and makes memoization useless. The try/except version only evaluates the predicate when there is a KeyError. – musicinmybrain Dec 09 '09 at 03:28
1

If you're dealing with extremely large lists, you want to use methods that don't involve traversing the entire list any more than you absolutely need to.

Your best bet is likely to be creating a filter function, and using itertools.ifilter, e.g.:

new_seq = itertools.ifilter(lambda x: 'X' in x or 'N' in x, seq)

This defers actually testing every element in the list until you actually iterate over it. Note that you can filter a filtered sequence just as you can the original sequence:

new_seq1 = itertools.ifilter(some_other_predicate, new_seq)

Edit:

Also, a little testing shows that memoizing found entries in a set is likely to provide enough of an improvement to be worth doing, and using a regular expression is probably not the way to go:

seq = ['AAT','XAC','ANT','TTA']
>>> p = re.compile('[X|N]')
>>> timeit.timeit('[x for x in seq if not p.search(x)]', 'from __main__ import p, seq')
3.4722548536196314
>>> timeit.timeit('[x for x in seq if "X" not in x and "N" not in x]', 'from __main__ import seq')
1.0560532134670666
>>> s = set(('XAC', 'ANT'))
>>> timeit.timeit('[x for x in seq if x not in s]', 'from __main__ import s, seq')
0.87923730529996647
Robert Rossney
  • 94,622
  • 24
  • 146
  • 218
  • however new_seq is a generator which is fine if you just want to loop over it, but sometimes you will need a list instead. – John La Rooy Dec 08 '09 at 21:01
  • If you're working with extremely large lists, anywhere that you're using lists instead of iterables is likely to be a trouble spot. You can trivially convert a generator to a list - `list(new_seq)` does that - but it does that by visiting every element in the list. If you're working with a list of a couple million codons, you really only want to do that once. – Robert Rossney Dec 08 '09 at 22:32
0

As S.Mark requested here is my version. It's probably slower but does make it easier to change what gets removed.

def filter_genome(genome, killlist = set("X N".split()):
    return [codon for codon in genome if 0 == len(set(codon) | killlist)]
James Brooks
  • 4,135
  • 4
  • 26
  • 25
0

It is (asympotically) faster to use a regular expression than searching many times in the same string for a certain character: in fact, with a regular expression the sequences is only be read at most once (instead of twice when the letters are not found, in gnibbler's original answer, for instance). With gnibbler's memoization, the regular expression approach reads:

import re
remove = re.compile('[XN]').search

codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
    if s not in memo:
        memo[s]= not remove(s)
    return memo[s]

print filter(pred,codon)

This should be (asymptotically) faster than using the "in s" or the "set" checks (i.e., the code above should be faster for long enough strings s).

I originally thought that gnibbler's answer could be written in a faster and more compact way with dict.setdefault():

codon = ['AAT','XAC','ANT','TTA']
def pred(s,memo={}):
    return memo.setdefault(s, not any(y in s for y in "XN"))

print filter(pred,codon)

However, as gnibbler noted, the value in setdefault is always evaluated (even though, in principle, it could be evaluated only when the dictionary key is not found).

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
0

If you want to modify the actual list instead of creating a new one here is a simple set of functions that you can use:

from typing import TypeVar, Callable, List
T = TypeVar("T")

def list_remove_first(lst: List[T], accept: Callable[[T], bool]) -> None:
    for i, v in enumerate(lst):
        if accept(v):
            del lst[i]
            return


def list_remove_all(lst: List[T], accept: Callable[[T], bool]) -> None:
    for i in reversed(range(len(lst))):
        if accept(lst[i]):
            del lst[i]
    
bennyl
  • 2,886
  • 2
  • 29
  • 43