2

I have two lists of equal length, one containing labels and the other data. For example:

labels = ['cat', 'cat', 'dog', 'dog', 'dog', 'fish', 'fish', 'giraffe', ...]
data = [ 0.3, 0.1, 0.9, 0.5, 0.4, 0.3, 0.2, 0.8, ... ]

How can I extract sub-lists of both lists in parallel based on a particular label in the labels list?

For example, using fish as a selection criteria, I want to generate:

selected_labels = [ 'fish', 'fish' ]
selected_data = [ 0.3, 0.2 ]

My best guess sounds cumbersome - make a list of element-wise tuples, extract a list of relevant tuples from that list, then de-tuple that list of tuples back into two lists of single elements. Even if that's the way to approach it, I'm too new to Python to stumble on the syntax for that.

Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
omatai
  • 3,448
  • 5
  • 47
  • 74

5 Answers5

4

Using zip() and a generator expression this can be done like:

Code:

tuples = (x for x in zip(labels, data) if x[0] == 'fish')
selected_labels, selected_data = map(list, zip(*tuples))

How does this work?

The tuples line builds a generator expression which zips the two lists together and drops any thing that is uninteresting. The second line uses zip again and then maps the resulting tuples into lists as desired.

This has the advantage of building no intermediate data structures so should be fairly fast and memory efficient.

Test Code:

labels = ['cat', 'cat', 'dog', 'dog', 'dog', 'fish', 'fish', 'giraffe']
data = [0.3, 0.1, 0.9, 0.5, 0.4, 0.3, 0.2, 0.8]

tuples = (x for x in zip(labels, data) if x[0] == 'fish')
selected_labels, selected_data = map(list, zip(*tuples))

print(selected_labels)
print(selected_data)

Results:

['fish', 'fish']
[0.3, 0.2]
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
  • Can I suggest something: instead of `tuples = ( ... )` which creates a mysterious object that is hard to inspect/print, that you change to `list_of_tuples = [ ... ]` which is easier to inspect/print, and ultimately easier to understand? If tuples are better in some way, perhaps explain why. – omatai Feb 12 '18 at 03:51
  • @omatai It's a generator, so you don't evaluate all at once. – Lingxi Feb 12 '18 at 03:53
  • As explained tuples is better because it is a generator expression. The suggested `[]` will build an intermediate list. The generator expression will build no extra data structures. But thanks for bringing this up. I have added a link to generator expression pep. If you need to debug, then definitely change to `[]` while debugging. – Stephen Rauch Feb 12 '18 at 03:54
  • In addition to `map()`, perhaps also show the use of list comprehension? – Lingxi Feb 12 '18 at 03:56
  • 1
    OK - so you're saying that just because it is enclosed in `()` it doesn't generate tuples immediately, it creates a generator object that will generate tuples when required? I think I get that. Still having difficulty comprehending the list unpacking, but working on it... – omatai Feb 12 '18 at 03:56
  • @omatai it is a generator expression, essentially, the list-comprehension equivalent of a generator. – juanpa.arrivillaga Feb 12 '18 at 03:57
  • 2
    In this particular case, generator doesn't actually pay back much. It's fully evaluated at `*tuples`. – Lingxi Feb 12 '18 at 03:59
  • One last thing: what if the list of data involves a more complex type, such as a numpy array? Should this code still work? In my case I do have more complex data, and cannot get this to work :-( – omatai Feb 12 '18 at 04:28
  • 1
    Iterating on Numpy/pandas data is often different. Numpy has very specialized (vectorized) ways to get performance. Suggest you put together another specific question. This question here was well received, I am sure you can construct another great question. – Stephen Rauch Feb 12 '18 at 04:31
3

This might be a good place to apply itertools.compress, which is slightly faster than zip, at least for the size of data structures you're working with.

from itertools import compress

selected_data = list(compress(data, (i=='fish' for i in labels)))
selected_labels = ['fish'] * len(selected_data)

Usage:

compress('ABCDEF', [1,0,1,0,1,1]) --> A C E F

Timing:

def with_compress():
    selected_data = list(compress(data, (i=='fish' for i in labels)))
    selected_labels = ['fish'] * len(selected_data)
    return selected_data, selected_labels

def with_zip():
    tuples = (x for x in zip(labels, data) if x[0] == 'fish')
    selected_labels, selected_data = map(list, zip(*tuples))
    return selected_data, selected_labels

%timeit -r 7 -n 100000 with_compress()
3.82 µs ± 96.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit -r 7 -n 100000 with_zip()
4.67 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

(i=='fish' for i in labels) is a generator of True and False. compress filters down data element-wise to cases where True occurs.

From the docstring:

Roughly equivalent to:

def compress(data, selectors):
    # compress('ABCDEF', [1,0,1,0,1,1]) --> A C E F
    return (d for d, s in zip(data, selectors) if s)
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
2

You can zip the lists together, filter them based on the keyword you are looking for and then unzip

>>> items = zip(*filter(lambda x: x[0] == "fish",zip(labels,data)))
>>> list(items)
>>> [('fish', 'fish'), (0.3, 0.2)]

Then your selected_data and selected_labels would be:

>>> selected_data = list(items[1])
>>> selected_labels = list(items[0])

Another alternative is to use map function to get the desired format:

 >>> items = map(list,zip(*filter(lambda x: x[0] == "fish",zip(labels,data))))
>>> list(items) 
>>> [['fish', 'fish'], [0.3, 0.2]]
Sohaib Farooqi
  • 5,457
  • 4
  • 31
  • 43
2

The simplest approach is totally fine here, and likely very performant:

>>> selected_labels, selected_data  = [], []
>>> for l, d in zip(labels, data):
...     if l == 'fish':
...         selected_labels.append(l)
...         selected_data.append(d)
...
>>> selected_labels
['fish', 'fish']
>>> selected_data
[0.3, 0.2]

Some more timings, didn't have time to include every approach so far, but here's a few:

>>> labels*=5000
>>> data *= 5000
>>> def juan(data, labels, target):
...     selected_labels, selected_data  = [], []
...     for l, d in zip(labels, data):
...         if l == target:
...             selected_labels.append(l)
...             selected_data.append(d)
...     return selected_labels, selected_data
...
>>> def stephen_rauch(data, labels, target):
...     tuples = (x for x in zip(labels, data) if x[0] == target)
...     selected_labels, selected_data = map(list, zip(*tuples))
...     return selected_labels, selected_data
...
>>> from itertools import compress
>>>
>>> def brad_solomon(data, labels, target):
...     selected_data = list(compress(data, (i==target for i in labels)))
...     selected_labels = ['fish'] * len(selected_data)
...     return selected_data, selected_labels
...
>>> import timeit
>>> setup = "from __main__ import data, labels, juan, stephen_rauch, brad_solomon"
>>> timeit.timeit("juan(data,labels,'fish')", setup, number=1000)
3.1627789690101054
>>> timeit.timeit("stephen_rauch(data,labels,'fish')", setup, number=1000)
3.8860850729979575
>>> timeit.timeit("brad_solomon(data,labels,'fish')", setup, number=1000)
2.7442518350144383

I would say, relying on itertools.compress is doing just fine. I was worried that having to do selected_labels = ['fish'] * len(selected_data) would slow it down, but it is an expression that could be highly optimized in Python (size of the list known ahead of time, and simply repeating the same pointer). Finally, note, the simple, naive approach I gave can be optimized by "caching" the .append method:

>>> def juan(data, labels, target):
...     selected_labels, selected_data  = [], []
...     append_label = selected_labels.append
...     append_data = selected_data.append
...     for l, d in zip(labels, data):
...         if l == target:
...             append_label(l)
...             append_data(d)
...     return selected_labels, selected_data
...
>>> timeit.timeit("juan(data,labels,'fish')", setup, number=1000)
2.577823764993809
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • 1
    Thanks for this - definitely the easiest answer to understand. That said, it is very useful to see a range of ways to solve the same problem, some of which introduce more powerful concepts. – omatai Feb 12 '18 at 04:01
0

As an alternative to the zip answer, you might consider using a different data structure. I would put that in a dict

data = {'cat' : [0.3, 0.1],
        'dog' : [0.9, 0.5, 0.4],
        'fish' : [0.3, 0.2],
        'giraffe' : [0.8],
        # ...
        }

Then to access, just data['fish'] will give [0.3, 0.2]

You can load the data you have into such a dictby doing this one time only

data2 = {}
for label, datum in zip(labels,data):
    if label not in data2:
        data2[label] = []
    data2[label].append(datum)

Then just do this for each query

select = 'fish'
selected_data = data2[select]
selected_labels = [select] * len(selected_data)
Alan Hoover
  • 1,430
  • 2
  • 9
  • 13
  • 1
    I don't have the luxury of supplying the data in a new data structure. If you want to avoid downvote, please specify the code to make the conversion. Also, weird as it seems, even if there are 10000 items of `fish` data, I still need a second list with 10000 identical labels of `fish`. – omatai Feb 12 '18 at 03:30