1

I'm trying to take a list of 4 million entries and rather than iterate over them all, reduce the list in the for loop that is enumerating them as it goes along.

The reduction criteria is found in the loop. Some later my_huge_list elements contain an combination of 2 consecutive elements that allows them to be discarded immediately.

Here I'm going to remove sublists with 1,2 and A,B in them from my_huge_list.

Please note I don't know in advance that 1,2 and A,B are illegal until I go into my for loop.

output_list = []

my_huge_list = [[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4],[0,1,2,3,4],[A,B],[0,1,3,A,B],[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4]...] #to 4m assorted entries

for sublist in my_huge_list[:]: 
   pair = None
   for item_index in sublist[:-1]: #Edit for Barmar.  each item in sublist is actually an object with attributes about allowed neighbors.
     if sublist[item_index +1] in sublist[item_index].attributes['excludes_neighbors_list']:
        pair = [sublist[item_index],sublist[item_index +1]]  #TODO build a list of pairs

   if pair != None: #Don't want pair in any item of output_list
      my_huge_list = [x for x in my_huge_list if not ','.join(pair) in str(x)]  #This list comprehension sole function to reduce my_huge_list from 4m item list to 1.7m items

  #if '1, 2' in str(sublist): #Don't want 1,2 in any item of output_list
        #my_huge_list = [x for x in my_huge_list if not '1, 2' in str(x)]  #This list comprehension sole function to reduce my_huge_list

  #elif 'A, B' in str(sublist): #Don't want A,B in any item of output_list
        #my_huge_list = [x for x in my_huge_list if not 'A, B' in str(x)]  #This list comprehension sole function to reduce my_huge_list from 1.7m item list to 1.1m items


  else:
     output_list.append(sublist) 


my_huge_list
>>>[[0,1,3,4],[0,1,3,4],[0,1,3,4],[0,1,3,4]...] 

So the 'for loop' unfortunately does not seem to get any faster because my_huge_list is still iterated over all 4m entries, even though it was quickly reduced by the list comprehension.

[The my_huge_list does not need to be processed in any order and does not need to be retained after this loop.]

[I have considered making the for loop into a sub-function and using map and also the shallow copy but can't figure this architecture out.]

[I'm sure by testing that the removal of list elements by list comprehension is quicker than brute-forcing all 4m sublists.]

Thanks!

Joylove
  • 414
  • 1
  • 5
  • 20
  • 3
    Converting the sublists to strings and then searching them is expensive. – Barmar Feb 02 '18 at 22:31
  • 4 millions is a big deal, I am not sure if this is possible in the context of your project but have you considered working with your list in a database using SQL rather than in memory in your program? – Alexis.Rolland Feb 02 '18 at 22:31
  • @ Barmar yes sorry I don't do this str conversion its actually a logical test of a numpy array value, I used str(x) to simplify the post. – Joylove Feb 02 '18 at 22:38
  • @Alexis.Rolland I could how would I arrange an SQL so this routine is faster? – Joylove Feb 02 '18 at 22:41
  • @Joylove if I understood correctly you just want to filter out the sublists [1, 2] and [A,B]. Assuming you have your data loaded in a table where each sublist is stored in one record. You could simply do something like: select sublist from table where sublist not in (‘[1, 2]’, ‘[A, B]’) – Alexis.Rolland Feb 02 '18 at 22:56
  • ... Why do you think iterating over your huge list *multiple times* will be faster than doing it once? – juanpa.arrivillaga Feb 02 '18 at 23:06
  • @Alexis.Rolland yes that would work because it's filtering my_huge_list. This is what I'm trying to achieve here with the list comprehension. A Filter on my_huge_list that reduced the size of the for loop itterable in-place. – Joylove Feb 02 '18 at 23:23

2 Answers2

1

Here's my dig on it:

my_huge_list = [[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4],[0,1,2,3,4],['A','B'],[0,1,3,'A','B'],[0,'A','B'],[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4]] #to 4m assorted entries

# ... do whatever and return unwanted list... #

# ... if needed, convert the returned items into lists before putting into unwanted ... #

unwanted = [[1,2], ['A','B']]

index = 0
while index < len(my_huge_list):
    sublist = my_huge_list[index]
    next = True
    for u in unwanted:
        if u in [sublist[j:j+len(u)] for j in range(len(sublist)-len(u)+1)] or u == sublist:
            my_huge_list.pop(index)
            next = False
    index += next

print(my_huge_list)

# [[0, 1, 3, 4], [0, 1, 3, 4], [0, 1, 3, 4], [0, 1, 3, 4]]

It's not elegant but it gets the job done. A huge caveat is that modifying a list while iterating over it is bad karma (pros will probably shake their heads at me), but dealing with a size of 4 mil you can understand I'm trying to save some memory by modifying in place.

This is also scale-able so that if you have multiple numbers of unwanted in different sizes, it should still catch it from your huge list. If your element size is 1, try to match the expected element type from your my_huge_list. e.g. if your my_huge_list has a [1], your unwanted should be [1] as well. If the element is a string instead of list, you'll need that string in your unwanted. int/float will however break this current code as you can't iterate over it, but you can add extra handling before you iterate through unwanted.

r.ook
  • 13,466
  • 2
  • 22
  • 39
  • Thanks, trying a while loop now. Will report back. – Joylove Feb 03 '18 at 00:41
  • Eventually my_huge_list and output list are the same length so I used that comparison, for code readability. This is working now. Thanks! – Joylove Feb 03 '18 at 01:59
  • About modifying a list during iteration, the for sublist in my_huge_list[:] is working on a shallow copy using the full slice, according to the documentation. https://docs.python.org/2.0/ref/for.html – Joylove Feb 03 '18 at 02:06
  • There are plenty good reasons for not modifying the same list while iterating it, see here: https://stackoverflow.com/questions/6260089/strange-result-when-removing-item-from-a-list shallow copy could work (basically list comprehension) but in your use case creating a shallow copy of up to 4mil entry seems nonsensical. – r.ook Feb 03 '18 at 03:12
0

You're iterating over your huge list twice, once in the main for loop, and then each time you find an invalid element you iterate over it again in the list comprehensions to remove all of those invalid elements.

It would be better to simply filter those elements out of the list once with a list comprehension.

def sublist_contains(l, pair):
    for i in range(len(l)-1):
        if l[i] == pair[0] and l[i+1] == pair[1]:
            return True
    return False

output_list = [sublist for sublist in my_huge_list if not(list_contains(sublist, ['A', 'B']) or list_contains(sublist, ['1', '2']))]

My sublist_contains() function assumes it's always just two elements in a row you have to test for. You can replace this with a more general function if necessary. See elegant find sub-list in list

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • I think this requires me to pass (pair) and to do that I need to check all the sublists of my_huge_list to discover pairs 1,2 and A,B are unwanted. Have I understood correctly? – Joylove Feb 02 '18 at 22:58
  • I don't understand the question. `pair` is being passed in the calls `list_contains(sublist, ['A', 'B']) or list_contains(sublist, ['1', '2'])` – Barmar Feb 02 '18 at 22:59
  • Well the question has 2 parts, first 1,2 and A,B are invalid so I can't do the list comprehension into output_list until both 1,2 and A,B have been excluded, otherwise output_list could contain instances of 1,2 or A,B. Second to actually identify 1,2 and A,B to pass as pair I need to iterate over on all 4M sublists as A,B could be found only once and in the last list index. – Joylove Feb 02 '18 at 23:18
  • The list comprehension returns all the sublists that don't have `1,2` or `A,B` in them. – Barmar Feb 02 '18 at 23:20
  • Don't you see the `if` in the list comprehension? That makes it a filter. – Barmar Feb 02 '18 at 23:21
  • @ Barmar but I don't know 1,2 and A,B are illegal until I run a test on each sublist in my_huge_list. They are not known until I inspect every item of my_huge_list. They are duplicated many times however, once they are known. This why I iterate over the list within the for loop. Once detected (slow) I go deleting duplicates (fast). The issue I have is the for loop that detects 1,2 and A,B is not reducing in size. – Joylove Feb 02 '18 at 23:27
  • I still don't understand. You're not inspecting every item, you're removing all of them after you detect the first one. But that's the same thing as just removing all of them. – Barmar Feb 02 '18 at 23:29
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164441/discussion-between-joylove-and-barmar). – Joylove Feb 02 '18 at 23:46
  • As an optimization, the whole of `sublist_contains` could be replaced with `return pair in zip(l, l[1:])` (or if `pair` isn't assumed to be a `tuple`, since you pass it as `list` in your example code, `return tuple(pair) in zip(l, l[1:])`). On Python 3 (and Python 2 if you do `from future_builtins import zip`) this will be lazy and short-circuiting, and it will do all the work at the C layer, running significantly faster. If `l[1:]` bothers you, `itertools.islice(l, 1, None)` will avoid making a temporary `list`. – ShadowRanger Feb 03 '18 at 00:00
  • @ShadowRanger I'm trying to avoid making lots of copies of the sublist, since this is so performance-sensitive. – Barmar Feb 03 '18 at 00:01
  • @Barmar: Heh, I was just editing that in as you posted. :-) `itertools.islice` to the rescue. That said, all the sublists are small, so I'd actually expect the eager slice to be faster because of how CPython tends to optimize slicing better than generalized function calls over generalized iterables. – ShadowRanger Feb 03 '18 at 00:01
  • @ShadowRanger Thanks. I'm not really familiar with all of `itertools`, but I'm hardly surprised it does this. I assume the question I linked to mentions it. – Barmar Feb 03 '18 at 00:02
  • `if not(list_contains(sublist, ['A', 'B']) or list_contains(sublist, ['1', '2'])` makes both `A,B` and `1,2` illegal. – Barmar Feb 03 '18 at 00:52
  • I guess I don't understand your criteria for illegal sequences. You say that you don't know in advance what's illegal, but I don't see how you're determining it. Your script has `A,B` hard-coded into it. – Barmar Feb 03 '18 at 00:54
  • you just decide that it's illegal when you find the first match of it, but that still requires that you know what it is in advance. – Barmar Feb 03 '18 at 00:55
  • OK got your question. A,B and 1,2 are determined as illegal by visiting my_huge_list[6] and my_huge_list[0] and iterating over sublist a check on the list element object attributes if they can be neighbors. A,B objects cannot be neighbors in a list, nor can 1,2. – Joylove Feb 03 '18 at 01:02
  • OK, `my_huge_list[6] == [A,B]`, so that defined one of the illegal values, but `my_huge_list == [0,1,2,3,4]`, how do you get `[1,2]` from that? – Barmar Feb 03 '18 at 01:05
  • You need to clarify all of this **in the question**, not in comments here. – Barmar Feb 03 '18 at 01:06
  • I'm going to dinner, maybe you can improve your question by the time I get back. – Barmar Feb 03 '18 at 01:07
  • I changed it, please take a look. Hope your dinner was good! – Joylove Feb 03 '18 at 01:17