12

I have a list with lenght of: 370000. In this list i have items like: "a", "y", "Y", "q", "Q", "p", "P",, meaning this is a list of words but from time to time i get those single characters.

I want to remove those characters from the list, i am pretty new in python but the first thing that came to my mind was to do something like:

for word in words:
    if word== 'm' or  word== 'y' or word== 'Y' or word== 'p' or word== 'Q' or word== 'q' or word== 'a' or word== 'uh':
        words.remove(word)

In a list with 370.000 items this method is taking a loooot of time. Seriously, a lot.

Does anyone have another awesome idea on how to get a better performance?

Thanks in advance.

NachoMiguel
  • 983
  • 2
  • 15
  • 29

6 Answers6

10

Tried some bogo-benchmarks in IPython.

import random
# Don't know how to generate your words, use integers as substitute.
words = [random.randint(0, 25) for i in xrange(370000)]
badlist = range(7)
badtuple = tuple(badlist)
badset = set(badlist)
# List comprehension
%timeit [w for w in words if w not in badlist]
10 loops, best of 3: 59.2 ms per loop
%timeit [w for w in words if w not in badtuple]
10 loops, best of 3: 64.7 ms per loop
%timeit [w for w in words if w not in badset]
10 loops, best of 3: 30.3 ms per loop
# Filter
%timeit filter(lambda w: w not in badlist, words)
10 loops, best of 3: 85.6 ms per loop
%timeit filter(lambda w: w not in badtuple, words)
10 loops, best of 3: 92.1 ms per loop
%timeit filter(lambda w: w not in badset, words)
10 loops, best of 3: 50.8 ms per loop

Conclusion: Probably, using list comprehension with not in <set> as filter condition would be best.

But as I've said, this benchmark is bogus, and you need to repeat some benchmarks on the actual kind of data you'll encounter to see which is better.


Some idea on why list comprehension + "not in set" is probably optimal.

  1. filter vs list comprehension: filter actually calls the input callable, and callable-calling in Python has its own overhead (creating stack frame, etc.) Also filter tries to be smart and return the correct type, which adds to overhead. (This is actually infinitesimally small) Instead, list comprehension's condition check (the if ... clause) has less overhead than a call. It's just expression evaluation without the full bells and whistles of Python call stack.
  2. Test for set membership is O(1) in average case and O(n) in worst case, but list/tuple membership is always O(n).
Cong Ma
  • 10,692
  • 3
  • 31
  • 47
  • import string and [random.choice(string.ascii_uppercase + string.ascii_lowercase) for _ in range(37000)] – midori Oct 03 '15 at 00:25
  • @BigOldTree I mean, I've got no idea if that's really how the OP's data look like, so I gave up guessing and just use some numbers, which is quick and dirty ;) – Cong Ma Oct 03 '15 at 00:27
  • i understand, i got different results with chars, so fastest one is the lambda with tuple – midori Oct 03 '15 at 00:29
  • all that aside, this is the best answer to point OP in the right direction (i.e. *test* the different ways) – Paul Evans Oct 03 '15 at 00:35
  • Can't reproduce the result of @BigOldTree , using random one-character strings with testcase size either 370000 or 37000. But the benchmark is mostly bogus anyway. – Cong Ma Oct 03 '15 at 00:36
  • i was preparing the same answer but for chars, Cong Ma posted this first, so i just added comment – midori Oct 03 '15 at 00:36
  • what does *Also filter tries to be smart and return the correct type,* mean? – Padraic Cunningham Oct 03 '15 at 01:02
  • @PadraicCunningham upon further thought it's irrelevant here, for the type-checking by `filter()` is just a constant small overhead. Python 3.x eliminated this by making `filter()` always return an iterator. – Cong Ma Oct 03 '15 at 10:02
  • @CongMa, python does not do any type checking – Padraic Cunningham Oct 03 '15 at 10:57
  • @PadraicCunningham According to the [source](https://github.com/python/cpython/blob/2.7/Python/bltinmodule.c#L253), it does so in the implementation of `filter()`, but this is just a constant overhead. – Cong Ma Oct 03 '15 at 11:06
  • That is not really type checking, also filter is often faster than a list comp for larger data as it is done at the c level, what is causing the code to run slower is the lamda – Padraic Cunningham Oct 03 '15 at 11:08
  • I gather that `filter()` should be faster on `str`, `unicode` and possibly `tuple` because the returned object is created directly. OTOH, doing the same thing as `filter`ing a `str` using listcomp followed by `"".join` would be wasteful. The lambda is an unfortunate necessity of `filter` but can be avoided entirely in this case by listcomp. When the filtering condition requires complex logic and a filter function is necessary even in listcomp, I expect `filter` on list to have roughly the same performance as listcomp. – Cong Ma Oct 03 '15 at 11:20
  • http://stackoverflow.com/a/31640292/2141635 – Padraic Cunningham Oct 03 '15 at 11:31
  • @PadraicCunningham I don't really see a difference in performance between `In [19, 20, and 21]`. In any case, this doesn't contradict what I said and is orthogonal to it. As you can see from the [source](https://github.com/python/cpython/blob/2.7/Python/bltinmodule.c#L303), evaluating `filter(None, ...)` avoids the call to `PyObject_Call` entirely, and that's why it _can_ be fast in this case. But there are cases when the lambda in `filter` can't be avoided (meanwhile can be avoided by listcomp syntax), and this happens to be the OP's case. – Cong Ma Oct 03 '15 at 11:47
  • @CongMa, you cannot see using filtering is faster than a list comp? I also already said that the lambda is the costly operation, talking about type-checking with filter is just confusing the issue, the overhead comes from the lambda. You would have been better off explaining the actual complexity of the solutions. – Padraic Cunningham Oct 03 '15 at 11:57
  • A set approach is going to be linear unless you are incredibly unlucky and have lots of values hashing to the same hash value, all the other solutions not using a set have a worse complexity so there is no ambiguity, for large n `O(n)` is going to better O(k^2) ) or O(n*k) – Padraic Cunningham Oct 03 '15 at 12:03
  • This thread is getting looong... As per @PadraicCunningham, certain comments are stricken out. Also, I don't really see a difference among Input cells 19 ~ 21 in that post. Basically the Python callable call dominates. – Cong Ma Oct 04 '15 at 14:11
3

You can use list comprehension, something like:

words = [word for word in words if word not in ["a", "y", "Y", "q", "Q", "p", "P", "uh"]]

list comprehension tends to give much better performance.

Edit (thanks to results from Cong Ma):

Seems the best performance is from using a set as filter sequence, so you want something more like:

words = [word for word in words if word not in set(("a", "y", "Y", "q", "Q", "P", "uh"))]
Paul Evans
  • 27,315
  • 3
  • 37
  • 54
1

"but from time to time i get those single characters."

I think the logic is poor in here. You should eliminate it when inserted word into the list. Removing it after a lengthy List is a bad choice after all.

I've face into the same problem, at first my solution is using pypy

I think there's a problem in pypy at that time (my code suddenly exited), so i change the code with a better logic, and use ordinary python.

nafsaka
  • 940
  • 2
  • 12
  • 17
1

Try a generator pipeline; this one has a simple application. Generators have good performance, and can often result in less memory usage, since the pipelines don't create huge temporary lists (although my final list violates this principal).

bad_list = ["a", "y", "Y", "q", "Q", "p", "P", "uh"]

# here is the generator "comprehension"
good_word_stream = (word for word in open("lexicon") if word not in bad_list)

# ... and use the output for something
print [len(word) for word in good_word_stream]
Prune
  • 76,765
  • 14
  • 60
  • 81
1

Modifying list on the fly is not a great idea when you have enough memory, it is easy to get thing wrong like the commment said.

As for performance, list.remove is a O(n) operation, thus your code is O(N^2).

List comprehension is much faster since it take more space -- create a new list/or generator in Python 3, use a small blacklist to filter out final results. Though I'm not sure if it will create ["a", "y", "Y", "q", "Q", "p", "P", "uh"] everytime, Cong Ma's deleted answer mentioned creating this small set(yes set, a in set() is O(1) operation!) first which might be helpful for performance.

And, list comprehension is about 25% slower than map or list(map(something)) in my previous test, I can't prove it right now but you might want to have a test.

Pypy/Cython would be a final solution if everything in Python you can do were done and performance still didn't meet production requirement..

SnoopyGuo
  • 185
  • 1
  • 9
  • I temporarily removed my answer to prevent edit. I added a few benchmarks (albeit bogus, as I used some numbers instead of strings, since we don't have the OP's data). – Cong Ma Oct 03 '15 at 00:12
  • @CongMa would you mind helping me test `map` versus list comprehension? I'm away from desktop/laptop currently.. – SnoopyGuo Oct 04 '15 at 08:37
-1
translate 1.5 faster than list comprehensions it seems
tested in 10000 runs

def remove_chars(string_, word_):
    # 10000 0.112017
    string_ += string_.upper()
    vowels_table = dict.fromkeys(map(ord, string_))
    return word_.translate(vowels_table)


def remove_chars2(string_,word_):
    # 10000 0.166002
    return [c for c in word_ if not c in string_]
LetzerWille
  • 5,355
  • 4
  • 23
  • 26