Best data structure for sparse data with multiple dimensions

Question

I want to structure my data (similarly to pandas) to allow easy data exploration. I tried using xarray.DataArray for this task (the recommended way to represent n-dimensional data in pandas http://pandas.pydata.org/pandas-docs/stable/dsintro.html#panel4d-and-panelnd-deprecated) but it appears inefficient given that my data is sparse. Is there a better way to structure my data under xarray.DataArray or under another Python data structure to allow easy data exploration?

Description of data

My data consists of prescriptions given to patients. Each entry consists of:

Date (datetime64)
Patient Id (int)
Drug name (string)
Drug type (string)
Drug class (string)
Scheduled dosage (real value)
Dosage as needed (real value)

There might be several prescriptions on a date for different patients. A patient might also be prescribed several drugs (e.g., 2-3 drugs) at the same time with 'mandatory' dosage and 'optional/as needed' dosage. My dataset currently consists of 397 different patients, 1520 different dates and 161 different drugs. I only have 21790 non-null entries out of the 397*1520*161*2 entries (i.e., 0.01%).

Initial code

My data is currently organized as the following xarray.DataArray:

drugs = xarray.DataArray(dosages, coords={'patient': patients, 'time': dates, 
                                          'drug': drug_names, 'timing': timings, 
                                          'drug_type': ('drug', drug_types), 
                                          'drug_class': ('drug', drug_classes)},
                         dims=['patient', 'time', 'drug', 'timing'])

where dosages.shape = (len(patients), len(dates), len(drug_names), 2). The timing axis corresponds to 'scheduled' vs. 'as needed' dosage. All the missing/zero entries are set to numpy.nan.

Assuming ~ 8 bytes per item, this would only be ~ 1.5 GB (e.g. `397*1520*161*2*8 / 2 ** 30`). This should fit comfortably in memory on almost any home PC, laptop, etc., and should be trivial on work / research / academic computing resources. Is memory the concern, or do you also want a suite of relational data algorithms (like the pandas API) but with native implementation for a sparse representation? — ely, Apr 05 '18 at 15:14
You might take a look at [h5py](https://www.h5py.org) and [PyTables](https://www.pytables.org); there's a good comparison [here](https://stackoverflow.com/questions/7883646/exporting-from-importing-to-numpy-scipy-in-sqlite-and-hdf5-formats) on SO. — denis, Sep 18 '19 at 09:00

score 1 · Answer 1 · answered Apr 05 '18 at 15:05

Currently (as of version 0.10.2) xarray supports only dense arrays, but there is a Github issue https://github.com/pydata/xarray/issues/1375 requesting sparse array support. A quick check of that issue suggests this is being actively worked on by enabling xarray to support the sparse module.

Best data structure for sparse data with multiple dimensions

1 Answers1