0

Conceptually, I have two lists of equal length, one containing labels and the other data. And so I asked this question, not realising that what I really had was two numpy arrays, not two lists.

What I do have is a folder containing images such as cat_01.jpg, cat_02.jpg, dog_01.jpg, dog_02.jpg, dog_03.jpg, fish_01.jpg, ..., tiger_03.jpg, zebra_01.jpg and zebra_02.jpg. I also have a successful program to read them in, parse a portion of each file name into a labels array, and the corresponding image data into my data array, so that I end up with something like:

>>> labels
array(['cat', 'cat', 'dog',  ..., 'tiger', 'zebra', 'zebra' ])
>>> type( data )
<class 'numpy.ndarray'>
>>> data[0][0][0]
array([78, 88, 98])

That makes sense - in each sample at (column, row), data[ sample ][ row ][ column ] represents an (R,G,B) data point.

I want to specify a search label such as 'dog', and (conceptually) use it to generate two "sub-lists" - the first containing all the (identical) matching labels in the labels list, and the other containing the associated image data from data. But rather than lists, I need to retain the original data format, in this case numpy arrays (but if there is a more general, data-insensitive approach, I'd love to know about it) . How can I do this?

Update: here's some specific test code to recreate the situation I am confronting, and with a sketch of a solution based on Stephen Rauch's answer:

import os, glob
from PIL import Image
import numpy as np
import pandas as pd    # not critical to question

def load_image(file):
  data = np.asarray(Image.open(file),dtype="float")
  return data

MasterClass = ['cat','dog','fsh','grf','hrs','leo','owl','pig','tgr','zbr']
os.chdir('data\\animals')
filelist = glob.glob("*.jpg")

full_labels = np.array([MasterClass.index(os.path.basename(fname)[:3]) for fname in filelist])
full_images = np.array([load_image(fname) for fname in filelist])
# The following sketch a solution, but which leads to incompatible data types
# That is, the test_images differ from the full_images and/or so do the labels
# with regard to the data types involved.
df = pd.DataFrame(dict(label=list(full_labels),data=list(full_images)))
criteria = df['label'] == MasterClass.index('dog')
test_labels = np.array(df[criteria]['label'])
test_images = np.array(df[criteria]['data'])

Two notes:

  • When originally I wrote that there were file names "such as" tiger_03.jpg, I was de-obfuscating reality. In truth the code above expects file names like tgr03.jpg, and the list of labels I end up working with is not even ['cat', 'cat', 'dog', ...] but is instead a list of indices in the MasterClass list - that is, [0, 0, 1, ...]
  • For test purposes the contents of the files don't actually matter, so long as they are valid (JPEG) images. You can easily test with a handful of (identical) files in a folder with a handful of different names.

The question is: how do I get test_labels and test_images to be in an identical format to the original full_labels and full_images but based on a selection criteria like the one sketched above? This code as it stands does not achieve this level of data compatibility - it does not achieve a strict "slice" function.

omatai
  • 3,448
  • 5
  • 47
  • 74

3 Answers3

1

If you can use pandas, it is VERY good at this sort of thing.

Code:

If you already have a dataframe, you can simply do:

# build a logical condition
have_dog = df['animal_label'] == 'dog'

# select the data when that condition is true
print(df[have_dog])

Test Code:

import pandas as pd
import numpy as np

animal_label = ['cat', 'cat', 'dog', 'dog', 'dog', 'fish', 'fish', 'giraffe']
data = [0.3, 0.1, 0.9, 0.5, 0.4, 0.3, 0.2, 0.8]
data = [np.array((x,) * 3) for x in data]

df = pd.DataFrame(dict(animal_label=animal_label, data=data))
print(df)

have_dog = df['animal_label'] == 'dog'
print(df[have_dog])

Results:

  animal_label             data
0          cat  [0.3, 0.3, 0.3]
1          cat  [0.1, 0.1, 0.1]
2          dog  [0.9, 0.9, 0.9]
3          dog  [0.5, 0.5, 0.5]
4          dog  [0.4, 0.4, 0.4]
5         fish  [0.3, 0.3, 0.3]
6         fish  [0.2, 0.2, 0.2]
7      giraffe  [0.8, 0.8, 0.8]

  animal_label             data
2          dog  [0.9, 0.9, 0.9]
3          dog  [0.5, 0.5, 0.5]
4          dog  [0.4, 0.4, 0.4]
Community
  • 1
  • 1
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
  • In your example, one of the "conceptual" lists is an actual list - the `animal_labels`. In my case, both of the objects are `numpy` arrays. I think if you use `numpy_labels = np.asarray(animal_labels)` you get the situation I intend... but then I can't construct the DataFrame - every permutation of `df = pd.DataFrame(dict(labels=list(numpy_labels),data=data))` I have tried fails with a "data must be 1-dimensional" exception. – omatai Feb 12 '18 at 22:09
  • There are myriad ways to construct a data frame. Dict is a but quick way if that is already the way your data is. I am not fully understanding your data description, but pandas is VERY good at mapping to native numpy structures. Primary point was just to show how to select dataframe rows based on a label match. – Stephen Rauch Feb 12 '18 at 22:13
  • I can now construct the DataFrame using `pd.DataFrame(dict(labels=list(nump_labels),data=list(numpy_data)))` - you have interpreted my "conceptual" lists as actual lists of numpy arrays. There are only numpy arrays "acting" as lists in a conceptual sense. – omatai Feb 12 '18 at 22:40
  • Appreciate the pointer in the direction of pandas :-) My only remaining issues seem to be ensuring that the two arrays extracted from the DataFrame are identical in format to the two original numpy arrays I had. Working on it... – omatai Feb 12 '18 at 22:40
  • I have updated question to include test code, partly based on this answer - hopefully that will make things clearer :-) – omatai Feb 12 '18 at 23:24
0

If I understand your problem correctly, this would be done by slicing like this:

selector = 'fish'
matching_labels = labels[labels==selector]
matching_data = data[labels==selector]

Alternatively, you could use the approach from the answer in your previous question and make the list alist a numpy array by alist = numpy.array(alist)

rammelmueller
  • 1,092
  • 1
  • 14
  • 29
  • The data is not labeled, so your last line would not work. The two conceptual "lists" are completely separate - they need to be zipped, cross-referenced, etc – omatai Feb 12 '18 at 21:04
0

Based on Stephen Rauch's answer to my earlier simpler question, it is possible to solve this as follows:

# assume full_labels and full_images exist as per test code in updated question
tuples = (x for x in zip(list(full_labels),list(full_images)) if x[0] == MasterClass.index('dog'))
xlabels,ximages = map(list, zip(*tuples))
test_labels = np.array(xlabels)
test_images = np.array(ximages)
omatai
  • 3,448
  • 5
  • 47
  • 74