-1

It seems there are quite a few questions and answers related to the speed of nested for loops - I think I looked at every single one of them! But unfortunately I am still not exactly sure why my code is slow. I'm hoping I can get some guidance from you fine people.

I download a csv file daily that has ~116,000 entries. Items are added and taken away from it at inconsistent points in the file, so every day I want to see what was added and what was taken away.

Getting the entries from csv to a list takes no time at all, for both the old and new list, but I encounter a big speed decrease in the next part of the code, although at the end, it does what I want and spits out the difference - items added and items removed.

Each of the 116,000 items in the list is a dictionary like so:

old or new = [{'Date Stamped': '', 'Name': '', 'Registration Number': '', 'Type': '', "Form Name':  '', 'URL': "}]

when I get to this point:

added = [i for i in new if not i in old]
removed = [i for i in old if not i in new]

It takes 25 minutes to finish! I feel like this is a long time, but also I may not be understanding exactly what I'm doing.

Each list (old and new) has ~116000 items in it. Is that because i has to iterate through ~116,000 items 4 times?

It does what I want, in the end, but it seems awfully slow for what it's doing; that said, this is really the first time I've worked with a data set with this many items, so maybe it's par for course.

Is this slow because it is a nested for loop? Is it slow because of the size? I am definitely an amateur and really appreciate everyone's help. Thanks so much.

martineau
  • 119,623
  • 25
  • 170
  • 301
Iris D
  • 82
  • 1
  • 7
  • what is your question? – deadshot Sep 01 '20 at 17:31
  • Use a `set` on the dict? – Joshua Nixon Sep 01 '20 at 17:31
  • 1
    `old or new = [{}]`? – deadshot Sep 01 '20 at 17:32
  • 4
    `in` will be incredibly slow for a list of 100,000+ items. If you need to do repeated membership testing, you should use a set (although that requires hashable objects, which dicts are not). – Carcigenicate Sep 01 '20 at 17:32
  • If you're asking about the performance of list comprehensions, that was addressed [here](https://stackoverflow.com/questions/22108488/are-list-comprehensions-and-functional-functions-faster-than-for-loops) – S.Chauhan Sep 01 '20 at 17:33
  • 2
    It's slow because in `[x for x in a if x not in b]` the code has to iterate through `b` for every `x` to check if it's present or not. It's better to make `b` a set if you can, because the cost of lookup is not proportionate to the size of a set. Better still make both `a` and `b` sets and subtract. – snakecharmerb Sep 01 '20 at 17:33
  • @deadshot my question was at the end of the post: "Is this slow because it is a nested for loop? Is it slow because of the size? ". – Iris D Sep 01 '20 at 19:28
  • @snakecharmerb thank you, this is a helpful answer, much appreciated. – Iris D Sep 01 '20 at 19:28

1 Answers1

3

Effectively, yes, it's slow because it's a nested for loop, because of the size.

Python's element in list operation works by just searching the entire list, element by element, for the one it wants. If you have to do that for every single element in new, that means you're possibly searching the entire old for each element in new.

Lists are not a good datastructure for searching through. What you should be doing instead, if you have a use case like this, is to transform them into a set first - an unordered collection (but order probably doesn't matter) which uses a hashtable to determine whether elements are present in it. Now, instead of searching the entire datastructure element-by-element, it can just hash the element being searched for, check if there's an element there, and say so if so.

In other words, element in set is an order of magnitude more efficient than element in list. For a relatively small overhead cost (in creating the sets in the first place), this shaves a huge amount of time off of the for loops that will follow:

old_set = set(old)
new_set = set(new)
added = [i for i in new if not i in old_set]
removed = [i for i in old if not i in new]

Furthermore, you can even dispense with the list comprehension, because set supports operations from set theory - taking the difference between two sets (elemenents in one set that are not in the other) is as easy as subtracting them:

added = list(new_set - old_set)  # (new_set - old_set) is identical to new_set.difference(old_set)
removed = list(old_set - new_set)

which is probably even more efficient than a list comprehension, because it's optimized for exactly this use case.

Green Cloak Guy
  • 23,793
  • 4
  • 33
  • 53
  • I think it's more than "an order of magnitude". Doesn't that mean something like 10 times faster? – superb rain Sep 01 '20 at 17:44
  • 1
    @superbrain Its complexity loses an `n` (goes from `O(n)` to `O(1)`), which is what I was trying to allude to in this context. – Green Cloak Guy Sep 01 '20 at 17:46
  • @GreenCloakGuy this was extraordinarily helpful. I haven't had much experience with sets, and while I saw info about them in the course of my research for this project, this answer also really helped me understand more about why it is slow and what my options are. Much appreciated. – Iris D Sep 01 '20 at 19:27
  • @GreenCloakGuy whoa! this was very fast! Because old and new were lists of dictionaries, I finally figured out how to get the values in the set via: old_set = set() for i in old: old_set.update(i.items()) (on both old_set and new_set) and then running the set operations that you mentioned in your note. So amazingly fast! I would like a little more info from output - is it possible to preserve the k:v pairs when converting from dict to set so that the output is more comprehensive? Many thanks again. – Iris D Sep 01 '20 at 22:24
  • @IrisD A `dict` is similar to a `set`, structurally - internally, they use the same data structure: a hashtable. If your data is already in the form of a `dict` (rather than a `list`), then you can use a similar dict comprehension `added = {k:v for k,v in new_dict.items() if k in old_dict}`. If your data is one-to-one key-value pairs, you could also reverse the dict `rev_dict = {v:k for k,v in old_dict.items()}`. Though, keep in mind that `dict`s don't support the arithmetic operations that `set`s do, so you'll want to look at the documentation to make sure you're doing the right thing. – Green Cloak Guy Sep 02 '20 at 03:44