8

I am a scientist recently converted from MATLAB to Python. I am looking for ways to structure my (mainly 2D and 3D) datasets. I have searched the net quite a bit, and it seems to me that robust and general-purpose data structuring in Python is still somewhat up in the air. I think this question and any answers will be highly relevant for other Python scientists looking for a way to structure data in a way that allows focusing on the problems at hand rather than the underlying implementation.

One example of the structure of my data is time x altitude x parameter, where parameter is e.g. density, temperature, etc. For the time dimension, I would like to use datetime objects, since this seems very robust and facilitates easy conversion, formatting, etc.

So far, I've looked into Pandas and MetaArray (from the SciPy cookbook).

Pandas' main drawback as a data type is that it's much more than just that. Each dimension in e.g. a Panel (items, major axis, minor axis) seem to have certain preferred uses, though I can not figure out which. The indexing in particular is different depending on the dimension, and some dimensions may not be expanded after creation of the data structure. Thus, even though some of Pandas' functions like grouping (.groupby) is really useful for a small part of my work, Pandas is not really intuitive for interactive scientific work, and I find myself looking for other options as my day-to-day data type.

I have also looked briefly into MetaArray from the SciPy cookbook. This looks more like a clean-cut data type, and the indexing seems really intuitive and flexible, making it much more suited to interactive scientific work. However, it is not (AFAIK) part of any package, and needs to be downloaded and installed manually, which makes portability more difficult if I need to collaborate with other scientists. Also, I find almost no examples of it being in use, and thus it seems rather like an ad-hoc solution to the problem of structuring N-dimensional datasets.

I have also heard of Blaze, purported as the "next-generation of NumPy", but as far as I can see that's still very much in early development. (Experiences with Blaze are welcome!)

Thus, I would like some examples (modules, packages, etc.) of how N-dimensional datasets (in particular 3D) may be structured in Python, most importantly in order to easily facilitate interactive use.

cmeeren
  • 3,890
  • 2
  • 20
  • 50
  • 3
    I would say just stick with Pandas until it becomes more familiar to you. – YXD Nov 21 '13 at 12:40
  • 1
    Agreed about pandas. You can always access the underlying numpy N-dimensional arrays with the `.values` attributes if the indices are meaningless, while still getting all the other advantages of pandas. With 3-D datasets you've got two main options [panel](http://pandas.pydata.org/pandas-docs/dev/dsintro.html#panel) or a DataFrame with a [MultiIndex](http://pandas.pydata.org/pandas-docs/dev/indexing.html#hierarchical-indexing-multiindex). Panels have some missing functionality at this point in time. – TomAugspurger Nov 21 '13 at 13:57
  • 1
    How are you handling this data in Matlab? – hpaulj Nov 21 '13 at 21:10
  • @hpaulj just a simple 3D matrix. Which works fairly well, except it's somewhat cumbersome to do interactive work since you constantly have to remember which array dimensions correspond to which axes, and which parameteres correspond to which indices along that certain axis, etc. Not really a big problem, but I like the idea of being able to select data as e.g. `data['density', 'altitude':200:400, 'time':'20131122':'20131124']` (example is MetaArray-like syntax, with "partial date string" indexing as in Pandas) – cmeeren Nov 22 '13 at 07:40
  • http://docs.scipy.org/doc/scipy/reference/tutorial/io.html is a Scipy tutorial on reading Matlab .mat files. May give ideas on how to relate Matlab structures to numpy ones. NetCDF and HDF5 are other scientific oriented file structures than can handled in numpy/scipy. – hpaulj Nov 22 '13 at 08:17
  • Thanks. The problem however is not relating MATLAB structures to numpy ones, but finding a data format (at least 3D) suitable for intuitive interactive work. – cmeeren Nov 22 '13 at 08:35
  • Indeed I also have looked into N-dimensional datatypes in Python for research works in machine learning, and just like you I couldn't find any really suitable solution. I think the best bet is indeed to stick to Pandas + Numpy (and in the future Blaze) which is already making a lot of enhancements in this area with the Panels. Also, you can try to reduce the dimensionality of your data using Pandas indexes: try to make some of the features a subindex of another feature (for example time > density > altitude). This way you can make any N-dimensional data into a hierarchical 2D or even 1D data. – gaborous Dec 27 '13 at 17:17

0 Answers0