The simplest approach is totally fine here, and likely very performant:
>>> selected_labels, selected_data = [], []
>>> for l, d in zip(labels, data):
... if l == 'fish':
... selected_labels.append(l)
... selected_data.append(d)
...
>>> selected_labels
['fish', 'fish']
>>> selected_data
[0.3, 0.2]
Some more timings, didn't have time to include every approach so far, but here's a few:
>>> labels*=5000
>>> data *= 5000
>>> def juan(data, labels, target):
... selected_labels, selected_data = [], []
... for l, d in zip(labels, data):
... if l == target:
... selected_labels.append(l)
... selected_data.append(d)
... return selected_labels, selected_data
...
>>> def stephen_rauch(data, labels, target):
... tuples = (x for x in zip(labels, data) if x[0] == target)
... selected_labels, selected_data = map(list, zip(*tuples))
... return selected_labels, selected_data
...
>>> from itertools import compress
>>>
>>> def brad_solomon(data, labels, target):
... selected_data = list(compress(data, (i==target for i in labels)))
... selected_labels = ['fish'] * len(selected_data)
... return selected_data, selected_labels
...
>>> import timeit
>>> setup = "from __main__ import data, labels, juan, stephen_rauch, brad_solomon"
>>> timeit.timeit("juan(data,labels,'fish')", setup, number=1000)
3.1627789690101054
>>> timeit.timeit("stephen_rauch(data,labels,'fish')", setup, number=1000)
3.8860850729979575
>>> timeit.timeit("brad_solomon(data,labels,'fish')", setup, number=1000)
2.7442518350144383
I would say, relying on itertools.compress
is doing just fine. I was worried that having to do selected_labels = ['fish'] * len(selected_data)
would slow it down, but it is an expression that could be highly optimized in Python (size of the list known ahead of time, and simply repeating the same pointer). Finally, note, the simple, naive approach I gave can be optimized by "caching" the .append
method:
>>> def juan(data, labels, target):
... selected_labels, selected_data = [], []
... append_label = selected_labels.append
... append_data = selected_data.append
... for l, d in zip(labels, data):
... if l == target:
... append_label(l)
... append_data(d)
... return selected_labels, selected_data
...
>>> timeit.timeit("juan(data,labels,'fish')", setup, number=1000)
2.577823764993809