0

Suppose I have a collection of objects which I wish to save in python, say, a list of numbers: [0.12, 0.85, 0.11, 0.12], [0.23, 0.52, 0.10, 0.19], etc. Suppose further that these objects are indexed by 3 attributes, say, "origin", "destination", and "month". I wish to store these objects in an array-like object which can be easily sliced, ideally using either numerical index or a name.

So, i.e.,

obj[2,1,7] # might return: [0.23, 0.52, 0.10, 0.19]

Or,

obj['chicago','new york','jan'] # might return: [0.12, 0.85, 0.11, 0.12]

And further,

obj[:,'new york','jan'] # would return data with first index = any.

I'm looking for the best practice to achieve this in python. I did find this post, which seems quite suitable, but it seemed to require some overhead and there was little discussion of alternatives. I also found something called the xarray package, though this doesn't seem as popular. I am transitioning form R, where I would do this the array() function, which adds a multi-dimensional index to any vector-like structure.

Zhaochen He
  • 610
  • 4
  • 12
  • I'm pretty sure you can do what you want with pandas, but accessing the values will be slightly more complicated than what you wrote – Novice Aug 16 '19 at 07:44
  • 1
    pandas can do it. Otherwise have a look at structured numpy arrays: https://docs.scipy.org/doc/numpy/user/basics.rec.html – DZurico Aug 16 '19 at 07:46
  • `pandas` is specialized for 2-dimensional 'tabular' data, though it does support hierarchical indices. if you want true multi-dimensional labeled indices use `xarray` – juanpa.arrivillaga Aug 16 '19 at 08:05
  • What is it that you want that pandas can't do? Or are you unfamiliar with pandas? – Novice Aug 16 '19 at 08:05
  • @Novice they want more than two-dimensions. `pandas` deprecated the 3-D [`Panel`](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Panel.html) and the docs even suggest `xarray` – juanpa.arrivillaga Aug 16 '19 at 08:08
  • I'm not very familiar with pandas, but it seemed to me that pandas is similar to the dataframe structure in R, in which things are stored "long". That is, I might store data like: (chicago, new york, 3.2); (chicago, boston, 6.7); (chicago, miami, 1.1), etc. So if I wanted all the data for chicago, I would need to use a filter operation to select all of the appropriate rows. This is OK, but it would be nice to have the data "folded" along its dimensions. – Zhaochen He Aug 16 '19 at 09:00
  • @Zhaochen He, yea, that's what I meant by what pandas could do. Just filtering options instead of indexing. You might be able to make pandas work smoother with multi-indexing but if you have too many variables it might not work nicely – Novice Aug 16 '19 at 11:18

1 Answers1

0

After some poking around, it appears that xarray is suitable for my needs. Unfortunately, given my lack of experience, I can't speak to compatibility with other packages or performance.

import numpy as np
import xarray as xr
cityOrig = ['chicago','new york', 'boston']
cityDest = ['chicago','new york', 'boston']
month = ['jan','feb','mar','apr']
data = np.random.rand(4,3,3,4)

myArray = xr.DataArray(data,
                       dims=['dat','orig','dest','month'],
                       coords = {'orig':cityOrig,'dest':cityDest,'month':month})

print(myArray[:,1,2,1].data)
[0.64  0.605 0.445 0.059]
print(myArray.loc[:,'chicago','new york','jan'].data)
[0.64  0.605 0.445 0.059]
Zhaochen He
  • 610
  • 4
  • 12
  • I would also recommend xarray at this point. It is part of the pydata ecosystem (together with pandas), and is [recommended for multi-dimensional data by pandas](https://pandas.pydata.org/pandas-docs/version/0.20.1/dsintro.html#deprecate-panel). It is definitely less mature than pandas, but is under active development. – mschrimpf Aug 27 '19 at 15:56