1

I have two large lists train and keep, with the latter containing unique elements, for e.g.

train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]

Is there a way to create a new list that has all the elements of train that are in keep using sets? The end result should be:

train_keep = [1, 3, 4, 3, 1]

Currently I'm using itertools.filterfalse from how to keep elements of a list based on another list but it is very slow as the lists are large...

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
ajrlewis
  • 2,968
  • 3
  • 33
  • 67

5 Answers5

3

Convert the list keep into a set, since that will be checked frequently. Iterate over train, since you want to keep order and repeats. That makes set not an option. Even if it was, it wouldn't help, since the iteration would have to happen anyway:

keeps = set(keep)
train_keep = [k for k in train if k in keeps]

A lazier, and probably slower version would be something like

train_keep = filter(lambda x: x in keeps, train)

Neither of these options will give you a large speedup you'd probably be better off using numpy or pandas or some other library that implements the loops in C and stores numbers as something simpler than full-blown python objects. Here is a sample numpy solution:

train = np.array([...])
keep = np.array([...])
train_keep = train[np.isin(train, keep)]

This is likely an O(M * N) algorithm rather than O(M) set lookup, but if checking N elements in keep is faster than a nominally O(1) lookup, you win.

You can get something closer to O(M log(N)) using sorted lookup:

train = np.array([...])
keep = np.array([...])
keep.sort()

ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]

A better alternative might be to append np.inf or a maximum out-of-bounds integer to the sorted keep array, so you don't have to distinguish missing from edge elements with extra at all. Something like np.max(train.max() + 1, keep.max()) would do:

train = np.array([...])
keep = np.array([... 99999])
keep.sort()

ind = np.searchsorted(keep, train, side='left')
train_keep = train[keep[ind] == train]

For random inputs with train.size = 10000 and keep.size = 10, the numpy method is ~10x faster on my laptop.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • I tried your clever solution using searchsorted() and it's wonderfully fast, but I had to add parentheses to get the masks to work: `mask = (ind > 0) & (ind < train.size)` and `extra = (ind == 0) & (train == keep[0])`. – John Jan 02 '20 at 04:44
  • @John. I appreciate the catch. Fixed. – Mad Physicist Jan 02 '20 at 06:10
  • I think I spoke too soon; I haven't been able to get the searchsorted() approach to work for me. The searchsorted docs say "Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved," which isn't the same as *matching* elements. Am I missing something? Also I was hoping to use it on a character array... – John Jan 02 '20 at 22:22
  • ...or maybe it would work if `train` and `keep` were indexes? – John Jan 02 '20 at 22:30
  • @John. Thanks again for the catch. Apparently I forgot a crucial step: checking that the indexed elements are actually equal to the target values. I'll fix properly and test when I get to a desktop, but it's something like `mask = (keep[ind] == train); result = train[mask]` – Mad Physicist Jan 03 '20 at 02:31
  • @John. I've fixed my answer. Don't know how nobody noticed this before... – Mad Physicist Jan 03 '20 at 05:17
  • Awesome! Seems to work perfectly now. BTW, I am new to Numpy and my data consist of ugly strings like 'ABC_S8#Q09#2#510a#6'. I read something about Numpy truncating long character strings. Do I have to worry about the final equality test (`train[keep[ind] == train]`) failing because of truncation? Also I assume the large number (99999) could just be len(data.index) + 1, right? – John Jan 03 '20 at 16:09
  • @John. That really depends. If the strings mean something, I would recommend trying to turn them into numbers somehow. If not, you have to make sure that the data type is long enough to hold the longest string, otherwise, yes, some might get truncated. – Mad Physicist Jan 03 '20 at 16:12
  • @John. Perhaps you have a problem worth asking a question about? I'd be happy to take a look. – Mad Physicist Jan 03 '20 at 19:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/205320/discussion-between-john-and-mad-physicist). – John Jan 03 '20 at 20:26
1
>>> keep_set = set(keep)
>>> [val for val in train if val in keep_set]
[1, 3, 4, 3, 1]

Note that if keep is small, there might not be any performance advantage to converting it to a set (benchmark to make sure).

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 2
    Since OP says it's very slow, I'm guessing that both of their actual datasets are a bit larger than shown. – Mad Physicist Jul 23 '19 at 10:51
  • @alex_lewis: Then the question isn't giving us enough information to provide a good answer. An MCVE of the performance issues might be a good start: https://stackoverflow.com/help/minimal-reproducible-example – NPE Jul 23 '19 at 10:56
  • 1
    @alex_lewis. You won't get much speedup with vanilla Python – Mad Physicist Jul 23 '19 at 10:58
0

this is an option:

train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]

keep_set = set(keep)
res = [item for item in train if item in keep_set]
# [1, 3, 4, 3, 1]

i use keep_set in order to speed up the look-up a bit.

hiro protagonist
  • 44,693
  • 14
  • 86
  • 111
  • Does it though? Speed up the lookup I mean. – Mad Physicist Jul 23 '19 at 10:48
  • i think i remember an example where there was a speedup for only 2 elements (but i could not give you a pointer nor did i run any benchmark with a current python interpreter)... – hiro protagonist Jul 23 '19 at 10:51
  • One thing you need to be careful of is that the `set(keep)` gets evaluated for every iteration of the loop eg: (every item in `train`)... Hence, it's worth moving it out of the list-comp – Jon Clements Jul 23 '19 at 10:52
  • @JonClements sure, in general that would need to be done. (might adapt he answer)... – hiro protagonist Jul 23 '19 at 10:53
  • 1
    Oh, even if you run it once, you're creating that set `len(train)` times each iteration of the list-comp... – Jon Clements Jul 23 '19 at 10:55
  • 1
    for instance... try... `def my_set(iterable): print('called'); return set(iterable)`... and then use `my_set` as the function call in your list-comp instead of `set`... – Jon Clements Jul 23 '19 at 10:56
0

The logic is the same, but give a try, maybe a generator is faster for your case:

def keep_if_in(to_keep, ary):
  for element in ary:
    if element in to_keep:
      yield element

train = [1, 2, 3, 4, 5, 5, 5, 5, 3, 2, 1]
keep = [1, 3, 4]
train_keep = keep_if_in(set(keep), train)

Finally, convert to a list when required or iterate directly the generator:

print(list(train_keep))

#  alternatively, uncomment this and comment out the line above,
#  it's because a generator can be consumed once
#  for e in train_keep:
#    print(e)
iGian
  • 11,023
  • 3
  • 21
  • 36
0

This is a slight expansion of Mad Physicist's clever technique, to cover a situation where the lists contain characters and one of them is a dataframe column (I was trying to find a list of items in a dataframe, including all duplicates, but the obvious answer, mylist.isin(df['col') removed the duplicates). I adapted his answer to deal with the problem of possible truncation of character data by Numpy.

#Sample dataframe with strings
d = {'train': ['ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510l','ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510d02','ABC_S8#Q09#2#510c#8y','ABC_S8#Q09#2#510a#6'], 'col2': [1,2,3,4,5,6]}
df = pd.DataFrame(data=d)

keep_list = ['ABC_S8#Q09#2#510a#6','ABC_S8#Q09#2#510b13','ABC_S8#Q09#2#510c#8y']

#Make sure the Numpy datatype accomodates longest string in either list
maxlen = max(len(max(keep_list, key = len)),len(max(df['train'], key = len))) 
strtype = '<U'+ str(maxlen) 

#Convert lists to Numpy arrays
keep = np.array(keep_list,dtype = strtype)
train = np.array(df['train'],dtype = strtype)

#Algorithm
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = df[keep[ind] == df['train']] #reference the original dataframe

I found this to be much faster than other solutions I tried.

John
  • 1,018
  • 12
  • 19
  • It's irrelevant that one list is a dataframe, since you can just do `df['train'].values` – Mad Physicist Jan 03 '20 at 20:33
  • You mention that this fails for larger data sets. Can you generate a random dataset (with a fixed seed) that reproduces the failure and ask a question about that? – Mad Physicist Jan 03 '20 at 20:35