I have been racking my brains for a good while now, and I am having issues figuring out what python package actually seems to be able to do the things which I need, in a relatively optimised matter.
I am trying to work with a large number of smaller datasets, and need to be able to process these into various plots which include various combinations of these datasets, and additionally do a small amount of processing on individual datasets.
The data is what you could say, multi-dimensional. Though, some of these dimensions are essentially single values.
To explain my data. It's spectroscopic data. I have a number of devices (1 dimension), I have measured counts as a function of wavelength. So the data is two columns (wavelength, counts), with each row being a wavelength and a count. I may process this data to add baseline and difference between baseline and count data (2 dimensions). There is some metadata which can be ignored. For each device I have taken this spectroscopic data under different conditions, different power, different temperature etc. (at least another 1 dimensions?).
Each spectra is in a different file. There are 100's of devices, I may do 20 power configurations on each device. I may do a couple of temperature configurations at specific powers but not all powers. For different devices the power configurations won't be the same. Some spectra won't be taken with the same centre wavelength on the spectrometer as others on the devices. Point I want to make here is the only thing that is guaranteed, is the dimensions of the spectra file (1024, 2) or (1024, 4) after processing.
Each filename contains the additional information relating to the configuration of the dataset which I will parse as the data is imported.
How I may process the data. My processing of the data revolves around plotting wavelength against the count rate for the various configurations (power, temp etc). The data has peaks which I will detect. I may wish to plot the count rates of certain wavelengths against the configurations (power, temp) etc. Generally, I will only want to plot a single device at a time, but due to the shear number of files I'm dealing with, I would like to be able to ingest the data, do the respective generic processing, and make plots, in one go. And I'm in no position to manually define structures and data each time.
I've looked at trying to use numpy, pandas, and xarray, and I'm wondering whether or not any of these can efficiently handle the data (at least in the current format).
I could nest a numpy array containing the spectroscopic data inside another numpy array which contains the device, power, and temperature info. Besides being maybe slightly cumbersome, this is basically what I have been doing so far. So the array would looke something like (device, power, temp, spectra) where spectra would be another numpy array (wavelength, spectra, baseline, difference).
I looked into using pandas (I am not familiar with pandas) and from my reading, it seems to suggest that nesting (while possible Pandas: Nesting Dataframes) is not a good idea Pandas: Storing a DataFrame object inside another DataFrame i.e. nested DataFrame, and I should use a multi index. Though, looking through the tutorials https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html for pandas, I can't quite say that understand how a multi-index could be constructed from the data I have. I'm not sure a single object can fit all the data.
Here is something I was trying. Basically, the function reads in a list of the spectra csv files for a single device with different powers. Add the additional configuration info to each row, and then joining both results to a single dataframe with the concat function and converts to a multi index.
powers = [53,62]
def importspectra(spectrafile):
frames = []
for n in range(0, len(spectrafiles)):
data = pd.read_csv(spectrafile[n], names=["wavelength", "counts"], skiprows=37)
data['device']="R2C7"
data['power']=powers[n]
frames.append(data)
frames = pd.concat(frames)
return pd.MultiIndex.from_frame(frames)
Now I thought I should be able to sort or select a subset of the data, like df['R2C7'].
data
Out[4]:
MultiIndex([(900.45203, 2, 'R2C7', 53),
(900.52997, 6, 'R2C7', 53),
(900.60785, 5, 'R2C7', 53),
(900.68579, 5, 'R2C7', 53),
(900.76367, 4, 'R2C7', 53),
(900.84161, 2, 'R2C7', 53),
(900.91949, 5, 'R2C7', 53),
(900.99738, 16, 'R2C7', 53),
(901.07532, 18, 'R2C7', 53),
( 901.1532, 14, 'R2C7', 53),
...
(978.35303, 119, 'R2C7', 62),
(978.42871, 127, 'R2C7', 62),
(978.50433, 125, 'R2C7', 62),
(978.57996, 102, 'R2C7', 62),
(978.65552, 102, 'R2C7', 62),
(978.73114, 121, 'R2C7', 62),
(978.80676, 124, 'R2C7', 62),
(978.88239, 145, 'R2C7', 62),
(978.95801, 188, 'R2C7', 62),
(979.03363, 123, 'R2C7', 62)],
names=['wavelength', 'counts', 'device', 'power'], length=2048)
data["R2C7"]
Traceback (most recent call last):
File "C:\Users\corih\.conda\envs\py39\lib\site-packages\IPython\core\interactiveshell.py", line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-5-4342b05204a9>", line 1, in <module>
data["R2C7"]
File "C:\Users\corih\.conda\envs\py39\lib\site-packages\pandas\core\indexes\multi.py", line 2028, in __getitem__
if level_codes[key] == -1:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
But clearly that does not work..
Additionally, I've looked at using xarray (again, I'm not familiar) and I have similar issues to pandas. I'm not sure how I can construct an xarray object from the data that I have.
Is there a better way to do this? Or am I wasting my time with higher dimensional arrays
Any suggestions would be much appreciated.