How can i iterate over a large list quickly?

Question

I am trying to iterate over a large list. I want a method that can iterate this list quickly. But it takes much time to iterate. Is there any method to iterate quickly or python is not built to do this. My code snippet is :-

for i in THREE_INDEX:
    if check_balanced(rc, pc):
        print('balanced')
    else:
        rc, pc = equation_suffix(rc, pc, i)

Here THREE_INDEX has a length of 117649. It takes much time to iterate over this list, is there any method to iterate it quicker. But it takes around 4-5 minutes to iterate

equation_suffix functions:

def equation_suffix(rn, pn,  suffix_list):
    len_rn = len(rn)
    react_suffix = suffix_list[: len_rn]
    prod_suffix = suffix_list[len_rn:]
    for re in enumerate(rn):
        rn[re[0]] = add_suffix(re[1], react_suffix[re[0]])
    for pe in enumerate(pn):
        pn[pe[0]] = add_suffix(pe[1], prod_suffix[pe[0]])
    return rn, pn

check_balanced function:

def check_balanced(rl, pl):
    total_reactant = []
    total_product = []
    reactant_name = []
    product_name = []
    for reactant in rl:
        total_reactant.append(separate_num(separate_brackets(reactant)))
    for product in pl:
        total_product.append(separate_num(separate_brackets(product)))
    for react in total_reactant:
        for key in react:
            val = react.get(key)
            val_dict = {key: val}
            reactant_name.append(val_dict)
    for prod in total_product:
        for key in prod:
            val = prod.get(key)
            val_dict = {key: val}
            product_name.append(val_dict)

    reactant_name = flatten_dict(reactant_name)
    product_name = flatten_dict(product_name)

    for elem in enumerate(reactant_name):
        val_r = reactant_name.get(elem[1])
        val_p = product_name.get(elem[1])
        if val_r == val_p:
            if elem[0] == len(reactant_name) - 1:
                return True
        else:
            return False

Are you able to make use of `multiprocessing` on multiple cores? — sjc, Mar 11 '20 at 15:33
Most likely the functions you call take most of the computation time, not just the iteration. — GoodDeeds, Mar 11 '20 at 15:33
Can you try to be more specific? If you are using numpy, you probably don't need the ``for`` loop at all. You can use vectorization to compute any function on all the elements in the array at once. — Tirth Patel, Mar 11 '20 at 15:33
Please make sure to measure to know exactly which operation takes the most time. Then find a way to optimize it — Pavel Vergeev, Mar 11 '20 at 15:34
If the list is so big you probably won't want to print the results like that anyways, cutting the press statement probably save you some time — shayaan, Mar 11 '20 at 15:34
How long is "much time"? How much time are you looking to reduce it to? If you are so concerned about performance, why are you using Python? — Scott Hunter, Mar 11 '20 at 15:34
To profile what's taking most of the time you can use cProfile — shayaan, Mar 11 '20 at 15:35
check_balanced and equation_suffix are functions defined above — programmer pro, Mar 11 '20 at 15:36
Just apply the functions directly and use the ``np`` equivalent of math functions if you are using any. — Tirth Patel, Mar 11 '20 at 15:36
iteration on large lists doesnt really take that long. I can process through a list of one million items in about 2 seconds. So its likely your functions are slow. but since you dont show them its hard to tell. imagine if i itreate over a lsit calling a function who then calls a subprocess that takes 4 seconds. Its not the iterating thats slow, its the function calls that are slow in each iteration — Chris Doyle, Mar 11 '20 at 15:39
Re "How can i use vectorization in numpy?": See [this QA](https://stackoverflow.com/questions/11442191/parallelizing-a-numpy-vector-operation) — lucidbrot, Mar 11 '20 at 15:43
Given the apparent data dependency imposed by the use of `rc` and `pc` from one iteration as function arguments in the next, I don't think any parallelization is possible. — chepner, Mar 11 '20 at 15:43
You have a print statement inside the loop. You loop over other things within the loop. The list contains sublists that you also iterate over? The performance issues have nothing to do with your choice of language. — Kenny Ostrom, Mar 11 '20 at 15:43
you have so many function calls in here that then call other functions we cannot see so its impossible for us to help you imrpvoe the performance. you have neted functions etc and function/method calls will all decrease performance. There are places you could improve performance by using list comprehensions instaed of loops with append but thats about all that could be said based on yoru current code. — Chris Doyle, Mar 11 '20 at 15:47
Minor performance note: You keep doing `for x in enumerate(...):`, then indexing for `x[0]` and `x[1]`. That's the wrong way to use `enumerate`; it's slower (indexing is surprisingly expensive in CPython, and failing to unpack inhibits an `enumerate` optimization to avoid allocating `tuple`s) and involves magic numbers (indexing to `0` and `1` not being as obvious as using real names). Unpack the value produced to useful names, and use them instead, and your inner loops should have much lower overhead. — ShadowRanger, Mar 11 '20 at 16:28
You can combine with `zip` to reduce indexing even further: Example replacement: Change `for re in enumerate(rn): rn[re[0]] = add_suffix(re[1], react_suffix[re[0]])` to `for i, (r, suffix) in enumerate(zip(rn, react_suffix)): rn[i] = add_suffix(r, suffix)`. Or even faster, use a listcomp or `map` to avoid `enumerate` entirely: `rn[:] = [add_suffix(r, suffix) for r, suffix in zip(rn, react_suffix)]` or `rn[:] = map(add_suffix, rn, react_suffix)` (assigning to `rn[:]` will modify the caller's `list` just like index by index assignment, but *much* faster). — ShadowRanger, Mar 11 '20 at 16:33
I ended up having enough micro-optimization comments that I [made an answer of it](https://stackoverflow.com/a/60641036/364696). You haven't given enough information to know what further algorithmic improvements might be possible (as I note in the answer, the use of `flatten_dict` on a `list` of `dict`s that all have one key makes me suspect they should have been built flattened from the start, saving a ton of overhead from constructing small `dict`s, and the short-circuiting `return False` also makes me suspect a missed optimization opportunity). — ShadowRanger, Mar 11 '20 at 16:59

RonZhang724 · Answer 1 · 2020-03-11T15:54:31.770

2

I believe the reason why "iterating" the list take a long time is due to the methods you are calling inside the for loop. I took out the methods just to test the speed of the iteration, it appears that iterating through a list of size 117649 is very fast. Here is my test script:

import time

start_time = time.time()
new_list = [(1, 2, 3) for i in range(117649)]
end_time = time.time()
print(f"Creating the list took: {end_time - start_time}s")

start_time = time.time()
for i in new_list:
    pass
end_time = time.time()
print(f"Iterating the list took: {end_time - start_time}s")

Output is:

Creating the list took: 0.005337953567504883s
Iterating the list took: 0.0035648345947265625s

Edit: time() returns second.

edited Mar 11 '20 at 15:54

answered Mar 11 '20 at 15:42

RonZhang724

535
8
17

This isnt really an answer but a comment, and its pretty much what everyone has already said in the comments – Chris Doyle Mar 11 '20 at 15:45
Actually my list is list of 117649 tuples and all tuples have a length of 3. – programmer pro Mar 11 '20 at 15:46
But my functions are executing very fast. – programmer pro Mar 11 '20 at 15:50
I edited the list just to create tuples of length of 3, till pretty fast imo – RonZhang724 Mar 11 '20 at 15:50
2

At least this shows programmer pro how to make a testable example. – Kenny Ostrom Mar 11 '20 at 15:59

score 1 · Accepted Answer · answered Mar 11 '20 at 16:57

General for loops aren't an issue, but using them to build (or rebuild) lists is usually slower than using list comprehensions (or in some cases, map/filter, though those are advanced tools that are often a pessimization).

Your functions could be made significantly simpler this way, and they'd get faster to boot. Example rewrites:

def equation_suffix(rn, pn, suffix_list):
    prod_suffix = suffix_list[len(rn):]
    # Change `rn =` to `rn[:] = ` if you must modify the caller's list as in your
    # original code, not just return the modified list (which would be fine in your original code)
    rn = [add_suffix(r, suffix) for r, suffix in zip(rn, suffix_list)]  # No need to slice suffix_list; zip'll stop when rn exhausted
    pn = [add_suffix(p, suffix) for p, suffix in zip(pn, prod_suffix)]
    return rn, pn

def check_balanced(rl, pl):
    # These can be generator expressions, since they're iterated once and thrown away anyway
    total_reactant = (separate_num(separate_brackets(reactant)) for reactant in rl)
    total_product = (separate_num(separate_brackets(product)) for product in pl)
    reactant_name = []
    product_name = []
    # Use .items() to avoid repeated lookups, and concat simple listcomps to reduce calls to append
    for react in total_reactant:
        reactant_name += [{key: val} for key, val in react.items()]
    for prod in total_product:
        product_name += [{key: val} for key, val in prod.items()]

    # These calls are suspicious, and may indicate optimizations to be had on prior lines
    reactant_name = flatten_dict(reactant_name)
    product_name = flatten_dict(product_name)

    for i, (elem, val_r) in enumerate(reactant_name.items()):
        if val_r == product_name.get(elem):
            if i == len(reactant_name) - 1:
                return True
        else:
            # I'm a little suspicious of returning False the first time a single
            # key's value doesn't match. Either it's wrong, or it indicates an
            # opportunity to write short-circuiting code that doesn't have
            # to fully construct reactant_name and product_name when much of the time
            # there will be an early mismatch
            return False

I'll also note that using enumerate without unpacking the result is going to get worse performance, and more cryptic code; in this case (and many others), enumerate isn't needed, as listcomps and genexprs can accomplish the same result without knowing the index, but when it is needed, always unpack, e.g. for i, elem in enumerate(...): then using i and elem separately will always run faster than for packed in enumerate(...): and using packed[0] and packed[1] (and if you have more useful names than i and elem, it'll be much more readable to boot).

Thanks. Now my code is pretty fast. This was my first question on stack Overflow and i got the answer within 10 hours. Great! — programmer pro, Mar 12 '20 at 15:37
Don't worry all functions are working just as well as they were before. — programmer pro, Mar 12 '20 at 15:38
Don't worry all functions are working just as well as they were before. — programmer pro, Mar 12 '20 at 15:40

How can i iterate over a large list quickly?

2 Answers2