I want to structure my data (similarly to pandas
) to allow easy data exploration. I tried using xarray.DataArray
for this task (the recommended way to represent n-dimensional data in pandas
http://pandas.pydata.org/pandas-docs/stable/dsintro.html#panel4d-and-panelnd-deprecated) but it appears inefficient given that my data is sparse. Is there a better way to structure my data under xarray.DataArray
or under another Python data structure to allow easy data exploration?
Description of data
My data consists of prescriptions given to patients. Each entry consists of:
- Date (datetime64)
- Patient Id (int)
- Drug name (string)
- Drug type (string)
- Drug class (string)
- Scheduled dosage (real value)
- Dosage as needed (real value)
There might be several prescriptions on a date for different patients. A patient might also be prescribed several drugs (e.g., 2-3 drugs) at the same time with 'mandatory' dosage and 'optional/as needed' dosage. My dataset currently consists of 397 different patients, 1520 different dates and 161 different drugs. I only have 21790 non-null entries out of the 397*1520*161*2 entries (i.e., 0.01%).
Initial code
My data is currently organized as the following xarray.DataArray
:
drugs = xarray.DataArray(dosages, coords={'patient': patients, 'time': dates,
'drug': drug_names, 'timing': timings,
'drug_type': ('drug', drug_types),
'drug_class': ('drug', drug_classes)},
dims=['patient', 'time', 'drug', 'timing'])
where dosages.shape = (len(patients), len(dates), len(drug_names), 2)
. The timing
axis corresponds to 'scheduled' vs. 'as needed' dosage. All the missing/zero entries are set to numpy.nan
.