I'm relatively new to data analysis using Python and I'm trying to determine the most practical and useful way to read in my data so that I can index into it and use it in calculations. I have many images in the form of np.arrays that each have a corresponding set of data such as x- and y-coordinates, size, filter number, etc. I just want to make sure each set of data is grouped together with its corresponding image. My first thought was sticking the data in an np.array of dataclass instances (where each element of the array is an instance that contains all my data). My second thought was a pandas dataframe.
My gut is telling me that using a dataframe makes more sense. Do np.arrays store nicely inside dataframes? What are the pros/cons to each method and which would be best if I will need to be pulling data from them often, and I always need to make sure the data can be matched with its corresponding image?
What variables I have to read in: x_coord - float, y_coord - float, filter - int, image - np.ndarray.
I've been trying to stick the image arrays into a pandas dataframe but when indexing into it using .loc
it is extremely slow to run the Jupyter Notebook cell. It was also very slow to populate the dataframe using .from_dict()
. I'm guessing dataframes weren't meant to hold np.ndarrays?
My biggest concerns are the bookkeeping and ease of indexing - What can I do to always make sure I can retrieve the metadata for the corresponding image? In what form should my data be in so I can easily extract an image and its metadata, or all images with the same filter number, etc.