3

I've created a pandas dataframe reading it from a scipy.io in the following way (file.sav is an IDL structure created on a different machine. The scipy.io creates a standard python dictionary):

from scipy import io
import pandas as p
import numpy as np
tmp=io.readsav('file.sav', python_dict = True)
df=pd.DataFrame(tmp,index=tmp['shots'].astype('int32'))

the dataframe contains a set of values (from file.sav) and as indices a series of integers of the form 19999,20000,30000 etc. Now I would like to take a subset of these indices, says

df.loc[[19999,20000]]

for some reasons I get errors of the form

raise ValueError('Cannot index with multidimensional key')

plus other and at the end

ValueError: Big-endian buffer not supported on little-endian compiler

But I've checked that both the machine I'm working on and the machine which has created the file.sav are both little endian. So I don't think this is the problem.

Paul H
  • 65,268
  • 20
  • 159
  • 136
Nicola Vianello
  • 1,916
  • 6
  • 21
  • 26
  • Can you post file.sav somewhere where we can try it? Or, better yet, a small section of file.sav that reproduces the error? – Dan Allan Sep 03 '13 at 19:05
  • does `df.loc[19999:20001]` work? do you really have a multi-index (meaning an index comprised of several columns)? – Paul H Sep 03 '13 at 19:09
  • I made a dummy file.sav available here http://db.tt/lKu7Jcsg. You can try on your self. Now shots is between 20000 and 20099. Actually the system suggested by Paul H works but the problem is that I would like to use indices which are not consecutive. Maybe the name of the question is incorrect. Actually I would like to take a subset of rows of the dataframe – Nicola Vianello Sep 03 '13 at 19:21
  • Can itemgetter work for this? Surely there is a simple way without using a lambda. – Chogg Aug 09 '17 at 22:40

2 Answers2

5

Your input file is big endian. see here to transform it: http://pandas.pydata.org/pandas-docs/dev/gotchas.html#byte-ordering-issues

Compare before and after

In [7]: df.dtypes
Out[7]: 
a        >f4
b        >f4
c        >f4
shots    >f4
dtype: object

In [9]: df.apply(lambda x: x.values.byteswap().newbyteorder())
Out[9]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 20000 to 20099
Data columns (total 4 columns):
a        100  non-null values
b        100  non-null values
c        100  non-null values
shots    100  non-null values
dtypes: float32(4)

In [10]: df.apply(lambda x: x.values.byteswap().newbyteorder()).dtypes
Out[10]: 
a        float32
b        float32
c        float32
shots    float32
dtype: object

Also set the index AFTER you do this (e.g. don't do it in the constructor)

df.set_index('shots',inplace=True)
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Actually you are right. I have checked the type of endian on both the machines as sys.byteorder with the same answer as little. So I thought this was not the problem. Actually do you have an idea of the more pythonic way to convert the tmp (which is a python dictionary) to little endian? – Nicola Vianello Sep 03 '13 at 21:13
  • you can what I do in the apply, its operating on the numpy arrays directly. I don't think readsav can do this conversion. – Jeff Sep 03 '13 at 21:41
  • Ok I found a way. Actually it was a little tricky (at least for me) to preserve the fact that tmp is a dictionary. But I found a workaround (maybe not the best one). Thanks a lot – Nicola Vianello Sep 03 '13 at 21:50
  • ok...you can do the conversion after its a frame in any event (and we may introduce a method to do this directly in 0.13)) – Jeff Sep 03 '13 at 22:04
  • If the Dataframe contain non-numeric data types, you can use [this](http://stackoverflow.com/a/34530065/1461850). – Lee Feb 26 '16 at 11:44
1

From your comments, I would approach the problem in the following way:

values_i_want = [19999, 20000, 20005, 20007]
subset = df.select(lambda x: x[0] in values_i_want)

if your dataframe is very large (sounds like it is), the select method will probably be pretty slow. In that case, another approach would be to loop through values_i_want taking cross sections (df.xs(val, level=0) and appending them to an output dataframe. In other words (untested):

for n, val in enumerate(values_i_want):
    if n == 0:
         subset = df.xs(val, level=0)
    else:
         subset = subset.append(df.xs(val, level=0))

Not sure if that'll be any faster. But it's worth trying if the select approach is too slow.

Paul H
  • 65,268
  • 20
  • 159
  • 136
  • Seems the `select` method is [deprecated](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.select.html). – Quinn Culver Nov 12 '19 at 17:44