Conceptually, I have two lists of equal length, one containing labels
and the other data
. And so I asked this question, not realising that what I really had was two numpy
arrays, not two lists.
What I do have is a folder containing images such as cat_01.jpg
, cat_02.jpg
, dog_01.jpg
, dog_02.jpg
, dog_03.jpg
, fish_01.jpg
, ..., tiger_03.jpg
, zebra_01.jpg
and zebra_02.jpg
. I also have a successful program to read them in, parse a portion of each file name into a labels
array, and the corresponding image data into my data
array, so that I end up with something like:
>>> labels
array(['cat', 'cat', 'dog', ..., 'tiger', 'zebra', 'zebra' ])
>>> type( data )
<class 'numpy.ndarray'>
>>> data[0][0][0]
array([78, 88, 98])
That makes sense - in each sample
at (column
, row
), data[ sample ][ row ][ column ]
represents an (R,G,B) data point.
I want to specify a search label such as 'dog'
, and (conceptually) use it to generate two "sub-lists" - the first containing all the (identical) matching labels in the labels
list, and the other containing the associated image data from data
. But rather than lists, I need to retain the original data format, in this case numpy
arrays (but if there is a more general, data-insensitive approach, I'd love to know about it) . How can I do this?
Update: here's some specific test code to recreate the situation I am confronting, and with a sketch of a solution based on Stephen Rauch's answer:
import os, glob
from PIL import Image
import numpy as np
import pandas as pd # not critical to question
def load_image(file):
data = np.asarray(Image.open(file),dtype="float")
return data
MasterClass = ['cat','dog','fsh','grf','hrs','leo','owl','pig','tgr','zbr']
os.chdir('data\\animals')
filelist = glob.glob("*.jpg")
full_labels = np.array([MasterClass.index(os.path.basename(fname)[:3]) for fname in filelist])
full_images = np.array([load_image(fname) for fname in filelist])
# The following sketch a solution, but which leads to incompatible data types
# That is, the test_images differ from the full_images and/or so do the labels
# with regard to the data types involved.
df = pd.DataFrame(dict(label=list(full_labels),data=list(full_images)))
criteria = df['label'] == MasterClass.index('dog')
test_labels = np.array(df[criteria]['label'])
test_images = np.array(df[criteria]['data'])
Two notes:
- When originally I wrote that there were file names "such as"
tiger_03.jpg
, I was de-obfuscating reality. In truth the code above expects file names liketgr03.jpg
, and the list of labels I end up working with is not even['cat', 'cat', 'dog', ...]
but is instead a list of indices in theMasterClass
list - that is,[0, 0, 1, ...]
- For test purposes the contents of the files don't actually matter, so long as they are valid (JPEG) images. You can easily test with a handful of (identical) files in a folder with a handful of different names.
The question is: how do I get test_labels
and test_images
to be in an identical format to the original full_labels
and full_images
but based on a selection criteria
like the one sketched above? This code as it stands does not achieve this level of data compatibility - it does not achieve a strict "slice" function.