Nested numpy array vs multiindex, what python data structure and package seems best suited?

Question

I have been racking my brains for a good while now, and I am having issues figuring out what python package actually seems to be able to do the things which I need, in a relatively optimised matter.

I am trying to work with a large number of smaller datasets, and need to be able to process these into various plots which include various combinations of these datasets, and additionally do a small amount of processing on individual datasets.

The data is what you could say, multi-dimensional. Though, some of these dimensions are essentially single values.

To explain my data. It's spectroscopic data. I have a number of devices (1 dimension), I have measured counts as a function of wavelength. So the data is two columns (wavelength, counts), with each row being a wavelength and a count. I may process this data to add baseline and difference between baseline and count data (2 dimensions). There is some metadata which can be ignored. For each device I have taken this spectroscopic data under different conditions, different power, different temperature etc. (at least another 1 dimensions?).

Each spectra is in a different file. There are 100's of devices, I may do 20 power configurations on each device. I may do a couple of temperature configurations at specific powers but not all powers. For different devices the power configurations won't be the same. Some spectra won't be taken with the same centre wavelength on the spectrometer as others on the devices. Point I want to make here is the only thing that is guaranteed, is the dimensions of the spectra file (1024, 2) or (1024, 4) after processing.

Each filename contains the additional information relating to the configuration of the dataset which I will parse as the data is imported.

How I may process the data. My processing of the data revolves around plotting wavelength against the count rate for the various configurations (power, temp etc). The data has peaks which I will detect. I may wish to plot the count rates of certain wavelengths against the configurations (power, temp) etc. Generally, I will only want to plot a single device at a time, but due to the shear number of files I'm dealing with, I would like to be able to ingest the data, do the respective generic processing, and make plots, in one go. And I'm in no position to manually define structures and data each time.

I've looked at trying to use numpy, pandas, and xarray, and I'm wondering whether or not any of these can efficiently handle the data (at least in the current format).

I could nest a numpy array containing the spectroscopic data inside another numpy array which contains the device, power, and temperature info. Besides being maybe slightly cumbersome, this is basically what I have been doing so far. So the array would looke something like (device, power, temp, spectra) where spectra would be another numpy array (wavelength, spectra, baseline, difference).

I looked into using pandas (I am not familiar with pandas) and from my reading, it seems to suggest that nesting (while possible Pandas: Nesting Dataframes) is not a good idea Pandas: Storing a DataFrame object inside another DataFrame i.e. nested DataFrame, and I should use a multi index. Though, looking through the tutorials https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html for pandas, I can't quite say that understand how a multi-index could be constructed from the data I have. I'm not sure a single object can fit all the data.

Here is something I was trying. Basically, the function reads in a list of the spectra csv files for a single device with different powers. Add the additional configuration info to each row, and then joining both results to a single dataframe with the concat function and converts to a multi index.

powers = [53,62]    
def importspectra(spectrafile):
    frames = []
    for n in range(0, len(spectrafiles)):
        data = pd.read_csv(spectrafile[n], names=["wavelength", "counts"], skiprows=37)
        data['device']="R2C7"
        data['power']=powers[n]
        frames.append(data)
    frames = pd.concat(frames)
    return pd.MultiIndex.from_frame(frames)

Now I thought I should be able to sort or select a subset of the data, like df['R2C7'].

data
Out[4]: 
MultiIndex([(900.45203,   2, 'R2C7', 53),
            (900.52997,   6, 'R2C7', 53),
            (900.60785,   5, 'R2C7', 53),
            (900.68579,   5, 'R2C7', 53),
            (900.76367,   4, 'R2C7', 53),
            (900.84161,   2, 'R2C7', 53),
            (900.91949,   5, 'R2C7', 53),
            (900.99738,  16, 'R2C7', 53),
            (901.07532,  18, 'R2C7', 53),
            ( 901.1532,  14, 'R2C7', 53),
            ...
            (978.35303, 119, 'R2C7', 62),
            (978.42871, 127, 'R2C7', 62),
            (978.50433, 125, 'R2C7', 62),
            (978.57996, 102, 'R2C7', 62),
            (978.65552, 102, 'R2C7', 62),
            (978.73114, 121, 'R2C7', 62),
            (978.80676, 124, 'R2C7', 62),
            (978.88239, 145, 'R2C7', 62),
            (978.95801, 188, 'R2C7', 62),
            (979.03363, 123, 'R2C7', 62)],
           names=['wavelength', 'counts', 'device', 'power'], length=2048)
data["R2C7"]
Traceback (most recent call last):
  File "C:\Users\corih\.conda\envs\py39\lib\site-packages\IPython\core\interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-4342b05204a9>", line 1, in <module>
    data["R2C7"]
  File "C:\Users\corih\.conda\envs\py39\lib\site-packages\pandas\core\indexes\multi.py", line 2028, in __getitem__
    if level_codes[key] == -1:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

But clearly that does not work..

Additionally, I've looked at using xarray (again, I'm not familiar) and I have similar issues to pandas. I'm not sure how I can construct an xarray object from the data that I have.

Is there a better way to do this? Or am I wasting my time with higher dimensional arrays

Any suggestions would be much appreciated.

There are basically 2 data formats, the multi dimensional numeric dtype `ndarray`, and object referencing lists. `object` dtype arrays and `pandas` series/frames are list-like (in terms of memory use and speed). There is also a structured array, most commonly created by `genfromtxt` from a `csv`. — hpaulj, Mar 29 '21 at 21:49
`pandas` stores it's data (and indices) in numpy arrays. How many will vary with the column dtypes and probably the indexing. A simple dataframe with a uniform dtype may store all values in one 2d array. When I've looked at frames with different dtypes it appears to group the columns by dtype. I don't know how it handles multiindexing. — hpaulj, Mar 29 '21 at 23:15

score 0 · Answer 1 · answered Mar 29 '21 at 22:01

0

Not sure if I understand the problem correctly but data["R2C7"] will not work because you named the column "device". If you want to get all the rows where the device "R2C7" was used you need to write the following:

print(data.loc[data['device'] == 'R2C7'])

Here a link to an other post where this is explained in detail.

I think pandas will do the trick for you. Just keep trying. It gets easier.

answered Mar 29 '21 at 22:01

flo

378
2
9

So when I try this, I get ```AttributeError: 'MultiIndex' object has no attribute 'loc'``` Which is strange, because it does suggest here https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html that .loc should work the way you suggested... – Stealthbird97 Mar 29 '21 at 22:49
maybe check the pandas version you are using? – flo Mar 30 '21 at 08:38
I have ```pandas 1.2.3 py39hf11a4ad_0``` – Stealthbird97 Mar 30 '21 at 21:20

Nested numpy array vs multiindex, what python data structure and package seems best suited?

1 Answers1