Python - pandas dataframe or array of dataclass instances for reading in data?

Question

I'm relatively new to data analysis using Python and I'm trying to determine the most practical and useful way to read in my data so that I can index into it and use it in calculations. I have many images in the form of np.arrays that each have a corresponding set of data such as x- and y-coordinates, size, filter number, etc. I just want to make sure each set of data is grouped together with its corresponding image. My first thought was sticking the data in an np.array of dataclass instances (where each element of the array is an instance that contains all my data). My second thought was a pandas dataframe.

My gut is telling me that using a dataframe makes more sense. Do np.arrays store nicely inside dataframes? What are the pros/cons to each method and which would be best if I will need to be pulling data from them often, and I always need to make sure the data can be matched with its corresponding image?

What variables I have to read in: x_coord - float, y_coord - float, filter - int, image - np.ndarray.

I've been trying to stick the image arrays into a pandas dataframe but when indexing into it using .loc it is extremely slow to run the Jupyter Notebook cell. It was also very slow to populate the dataframe using .from_dict(). I'm guessing dataframes weren't meant to hold np.ndarrays?

My biggest concerns are the bookkeeping and ease of indexing - What can I do to always make sure I can retrieve the metadata for the corresponding image? In what form should my data be in so I can easily extract an image and its metadata, or all images with the same filter number, etc.

Can you provide a sample of your data ? You're saying `I'm trying to determine the most practical and useful way to store my data`. It seems like you need a database. — IMCoins, Jul 15 '19 at 09:25
Sorry, I want to be able to index into the data to pull out the different variables to use in calculations. I don't need to manipulate the data once it is stored somewhere, but I need to use it within Python to perform data analysis. Maybe the correct question should be "What is the best way to read in my data?" — lydia-441, Jul 15 '19 at 14:38
There is no "best way" per-se. But there is a "best way" if you have criteria, like speed, readability, etc... you have a million choices to do your task when dealing with programming --- That being said, I think you should use a json file to represent your data. Or a proper relational database, but as you seem new, a json file will be enough to begin with I think. — IMCoins, Jul 15 '19 at 14:43
In what initial format (ie, before becoming np arrays) does the data exist as NOW? Or if it doesn't exist yet, in what format will it BEGIN? RAW? JPEG? BMP? Stored locally or remotely? — Rick, Jul 15 '19 at 15:03
Right now I have .png files for the images stored locally. For the metadata I have text files in which are sentences I parse through to extract the metadata which are stored in np.arrays like this: `np.array([x_coord, y_coord, filter])`. I read the images in as a `skimage` ImageCollection so they are represented as `np.arrays`. — lydia-441, Jul 15 '19 at 15:09
My biggest concerns are the bookkeeping and ease of indexing - What can I do to always make sure I can retrieve the correct metadata for the corresponding image? In what form should my data be in so I can easily extract an image and its metadata, or all images with the same filter number, etc. — lydia-441, Jul 15 '19 at 15:13
Tip: if you reply to a specific person using the format @ (with spaces removed), they will get a notification you responded to them. It sounds to me like you might benefit from storing the metadata separately from the large PNG files. dataclasses is a great time saver/simplifier for the holder of this metadata. i would suggest using Pandas to organize the metadata and make it easily manipulated, with a column that refers to the actual image locations on disk. when you actually need to read the files to create new pandas columns, use generators to open them one at time. — Rick, Jul 15 '19 at 15:17
If you need to read/write to the large files a LOT, you may want to look into shared memory or asynchronous strategies, and spin out the tasks working on each file separately so your notebook doesn't get bogged down. this might help store the metadata in a shared memory space that can be accessed by workers doing things to the files: https://docs.python.org/3.8/library/multiprocessing.shared_memory.html — Rick, Jul 15 '19 at 15:19
@RickTeachey Thanks for the tip. I think I will follow that route, although I was really hoping for a way to store the images in their array format along with the metadata because I will need to use all the images and metadata at once (about 1000 70x70 pixel images), not one at a time. But there just doesn't seem to be a proper way to do that. I will also take a look into the shared memory strategies. — lydia-441, Jul 15 '19 at 15:27
Combining generators and shared memory is probably going to be faster where you need to do a single set of tasks to files 1 by 1, and use less memory. You could write a context manager (usage of the `with...` keyword) to provide transparent access to the file. Then perhaps another method that accepts a pandas DF, an index (specific to THAT dataclass instance), a col name, and a task function as args, and performs each task function, writing the result to the dataframe at the index and col. For a set of tasks where you would end up re loading the files over and over, might not be optimal. — Rick, Jul 15 '19 at 15:34
Another strategy- if you have plenty of memory to load the files, and the notebook just gets bogged down as you described by having the np arrays in the nb process- might be to store them separately from the metadata in the nb as I suggested, but in memory rather than on disk. — Rick, Jul 15 '19 at 15:38
@RickTeachey This is a tad above my current knowledge so I will do some more research on the items you suggested to see how to implement them and whether or not that is what I'm looking for. Thank you! I will most likely store the images separately like you suggested. — lydia-441, Jul 15 '19 at 15:40
This question and answers might help: https://stackoverflow.com/questions/45545110/how-do-you-parallelize-apply-on-pandas-dataframes-making-use-of-all-cores-on-o and here is an article about swifter: https://medium.com/@jmcarpenter2/swiftapply-automatically-efficient-pandas-apply-operations-50e1058909f9 — Rick, Jul 15 '19 at 15:44
Also, final meta tip: if you find yourself crafting a solution that is overly complicated, that's a good sign you are probably taking a misstep. There is rarely an occasion to make things very complicated in the python world. — Rick, Jul 15 '19 at 16:00

Python - pandas dataframe or array of dataclass instances for reading in data?

0 Answers0