0

I'm having trouble understanding pandas reindex. I have a series of measurements, munged into a multi-index df, and I'd like to reindex and interpolate those measurements to align them with some other data.

My actual data has ~7 index levels and several different measurements. I hope the solution for this toy data problem is applicable to my real data. It's "small data"; each individual measurement is a couple KB.

Here's a pair of toy problems, one which shows the expected behavior and one which doesn't seem to do anything.

Single-level index, works as expected:

"""
step,value
1,1
3,2
5,1
"""
df_i = pd.read_clipboard(sep=",").set_index("step")
print(df_i)

new_index = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

df_i = df_i.reindex(new_index).interpolate()
print(df_i)

Outputs, the original df and the re-indexed and interpolated one:

      value
step       
1         1
3         2
5         1
      value
step       
1       1.0
2       1.5
3       2.0
4       1.5
5       1.0
6       1.0
7       1.0
8       1.0
9       1.0

Works great.

Multi-index, currently not working:

"""
sample,meas_id,step,value
1,1,1,1
1,1,3,2
1,1,5,1
1,2,3,2
1,2,5,2
1,2,7,1
1,2,9,0
"""
df_mi = pd.read_clipboard(sep=",").set_index(["sample", "meas_id", "step"])
print(df_mi)

df_mi = df_mi.reindex(new_index, level="step").interpolate()
print(df_mi)

Output, unchanged after reindex (and therefore after interpolate):

                     value
sample meas_id step       
1      1       1         1
               3         2
               5         1
       2       3         2
               5         2
               7         1
               9         0


                     value
sample meas_id step       
1      1       1         1
               3         2
               5         1
       2       3         2
               5         2
               7         1
               9         0

How do I actually reindex a column in a multi-index df?

Here's the output I'd like, assuming linear interpolation:

                     value
sample meas_id step       
1      1       1         1
               2       1.5
               3         2
               5         1
               6         1
               7         1
               8         1
               9         1
       2       1       NaN (or 2)
               2       NaN (or 2)
               3         2
               4         2
               5         2
               6       1.5
               7         1
               8       0.5
               9         0

I spent some sincere time looking over SO, and if the answer is in there, I missed it:

Fill multi-index Pandas DataFrame with interpolation

Resampling Within a Pandas MultiIndex

pandas multiindex dataframe, ND interpolation for missing values

Fill multi-index Pandas DataFrame with interpolation

https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing

Possibly related GitHub issues:

https://github.com/numpy/numpy/issues/11975

https://github.com/pandas-dev/pandas/issues/23104

https://github.com/pandas-dev/pandas/issues/17132

Evan
  • 2,121
  • 14
  • 27

1 Answers1

1

IIUC create the index by using MultiIndex.from_product, then just do reindex

idx=pd.MultiIndex.from_product([df_mi.index.levels[0],df_mi.index.levels[1],new_index])    
df_mi.reindex(idx).interpolate()
Out[161]: 
          value
1 1 1  1.000000
    2  1.500000
    3  2.000000
    4  1.500000
    5  1.000000
    6  1.142857
    7  1.285714
    8  1.428571
    9  1.571429
  2 1  1.714286 # here is bad , it take previous value into consideration 
    2  1.857143
    3  2.000000
    4  2.000000
    5  2.000000
    6  1.500000
    7  1.000000
    8  0.500000
    9  0.000000

My think

def idx(x):
    idx = pd.MultiIndex.from_product([x.index.get_level_values(0).unique(), x.index.get_level_values(1).unique(), new_index])
    return idx



pd.concat([y.reindex(idx(y)).interpolate() for _,y in df_mi.groupby(level=[0,1])])

       value
1 1 1    1.0
    2    1.5
    3    2.0
    4    1.5
    5    1.0
    6    1.0
    7    1.0
    8    1.0
    9    1.0
  2 1    NaN
    2    NaN
    3    2.0
    4    2.0
    5    2.0
    6    1.5
    7    1.0
    8    0.5
    9    0.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thanks - you're right that that interpolation is bad. For my data, I can't have that kind of "crosstalk" if other index value have changed (here, `meas_id` changes from 1 to 2.) – Evan Dec 14 '18 at 23:05
  • @Evan check the my method . :-) hope it solve the problem , and also interpolate will not handle ffill problem , that is why you have NaN in second similar with single index , reindex output – BENY Dec 14 '18 at 23:10
  • It certainly solves the toy problem! I will try to apply it to my real problem and follow up with any issues there. Thank you! – Evan Dec 14 '18 at 23:22
  • For sure. It just seems like `reindex` is still not working as it should, and your workaround (which is excellent) shouldn't be necessary. I'm also concerned as to how it will scale, but kludgy adaptations are par for the course for pandas... – Evan Dec 15 '18 at 01:01