1

I have a nested list of tens of millions of lists (I can use tuples also). Each list is 2-7 items long. Each item in a list is a string of 1-5 characters and occurs no more than once per list. (I use single char items in my example below for simplicity)

#Example nestedList: 

nestedList = [
    ['a', 'e', 'O', 'I', 'g', 's'],
    ['w', 'I', 'u', 'O', 's', 'g'],
    ['e', 'z', 's', 'I', 'O', 'g']
]

I need to find which lists in my nested list contain a pair of items so I can do stuff to these lists while ignoring the rest. This needs to be as efficient as possible.

I am using the following function but it seems pretty slow and I just know there has to be a smarter way to do this.

def isBadInList(bad, checkThisList):
    numChecks = len(list) - 1
    for x in range(numChecks):
        if checkThisList[x] == bad[0] and checkThisList[x + 1] == bad[1]:
            return True
        elif checkThisList[x] == bad[1] and checkThisList[x + 1] == bad[0]:
            return True
    return False

I will do this,

bad = ['O', 'I']

for checkThisList in nestedLists:
    result = isBadInList(bad, checkThisList)
    if result:
        doStuffToList(checkThisList)

#The function isBadInList() only returns true for the first and third list in nestedList and false for all else.

I need a way to do this faster if possible. I can use tuples instead of lists, or whatever it takes.

sloganq
  • 23
  • 4
  • (1) I assume the string items aren't all one character long? (2) Do you plan to run this operation often for same value of `nestedLists` and different `bad` or vice versa or is everything different at each run? – Michael Butscher Mar 26 '19 at 01:55
  • (3) How many different string items are there roughly? – Michael Butscher Mar 26 '19 at 01:59
  • See [Checking if list is a sublist](https://stackoverflow.com/questions/35964155/checking-if-list-is-a-sublist). – martineau Mar 26 '19 at 02:12
  • Each string item is 1 - 5 characters long. Also, for the future, I'm considering switching the string items to ints, which will represent unique strings items. – sloganq Mar 26 '19 at 03:00
  • I kind of cheated to solve my problem. I found a way to represent pairs of items as integers. So now I can simply use: is intX in listY. – sloganq Mar 26 '19 at 21:08

2 Answers2

0
nestedList = [
    ['a', 'e', 'O', 'I', 'g', 's'],
    ['w', 'I', 'u', 'O', 's', 'g'],
    ['e', 'z', 's', 'I', 'O', 'g']
]

#first create a map
pairdict = dict()


for i in range(len(nestedList)):
    for j in range(len(nestedList[i])-1):
        pair1 = (nestedList[i][j],nestedList[i][j+1])
        if pair1 in pairdict:
            pairdict[pair1].append(i+1)
        else:
            pairdict[pair1] = [i+1]
        pair2 = (nestedList[i][j+1],nestedList[i][j])
        if pair2 in pairdict:
            pairdict[pair2].append(i+1)
        else:
            pairdict[pair2] = [i+1]

del nestedList

print(pairdict.get(('e','z'),None))

create a value pair and store them into map,the key is pair,value is index,and then del your list(this maybe takes too much memory), and then ,you can take advantage of the dict for look up,and print the indexes where the value appears.

goudan
  • 58
  • 4
0

I think you could use some regex here to speed this up, although it will still be a sequential operation so your best case is O(n) using this approach since you have to iterate through each list, however since we have to iterate over every sublist as well that would make it O(n^2).

import re

p = re.compile('[OI]{2}|[IO]{2}') # match only OI or IO

def is_bad(pattern, to_check): 
    for item in to_check:
        maybe_found = pattern.search(''.join(item))
        if maybe_found:
            yield True
        else:
            yield False


l = list(is_bad(p, nestedList))

print(l)
# [True, False, True]
gold_cy
  • 13,648
  • 3
  • 23
  • 45