0

I have a large pandas dataframe, with time series data and a rather large multiindex. Said index contains various information about the time series, such as for example location, datatype and so on.

Now I want to add a new row to the index, with an integer (or float, doesnt really matter), containing a distance to a certain point. Following that, I want to sort the dataframe by this distance.

I am not sure how to add a new index level, and how to assign new values by hand. Also, can pandas even sort columns after an random number in one of its index levels?

Example

(code from here )

header=pd.MultiIndex.from_product([['location1','location2'],['S1','S2','S3']],names=['loc','S'])
df = pd.DataFrame(np.random.randn(5, 6), index=['a','b','c','d','e'], columns = header)

Looks like this:

loc  location1                      location2                    
S           S1        S2        S3         S1        S2        S3
a     1.530590  0.536364  1.295848   0.422256 -1.853786  1.334981
b     0.275857 -0.848685 -1.212584  -0.464235 -0.855600  0.680985
c    -1.209607  0.265359 -0.695233   0.643896  1.315216 -0.751027
d    -1.591613 -0.178605  0.878567   0.647389 -0.454313 -1.972509
e     1.098193 -0.766810  0.087173   0.714301 -0.886545 -0.826163

What I want to to, is in a first step, add some distances to each column, like location1 S1 add dist 200, location1 S2 add dist 760 and so on, resulting in this:

loc  location1                      location2                    
S           S1        S2        S3         S1        S2        S3
dist       200       760        10       1000       340        70
a     1.530590  0.536364  1.295848   0.422256 -1.853786  1.334981
b     0.275857 -0.848685 -1.212584  -0.464235 -0.855600  0.680985
c    -1.209607  0.265359 -0.695233   0.643896  1.315216 -0.751027
d    -1.591613 -0.178605  0.878567   0.647389 -0.454313 -1.972509
e     1.098193 -0.766810  0.087173   0.714301 -0.886545 -0.826163

And then do something like df.sortlevel('dist'), resulting in

loc location1 location2 location1 location2 location1 location2
S          S3        S3        S1        S2        S2        S1
dist       10        70       200       340       760      1000
a    1.295848  1.334981  1.530590 -1.853786  0.536364  0.422256
b   -1.212584  0.680985  0.275857 -0.855600 -0.848685 -0.464235
…

So that have the whole thing sorted by distance, for things like plt.matshow(df.corr()).

Can pandas even sort a df after a random index with an integer? Because I have another dataframe, that already has an integer in its multindex, and here some_otherdf.sortlevel('HZB') results in TypeError: can only sort by level with a hierarchical index

Edit:

As of now, there is two answers, both of which work perfectly fine for my test case. I think @Pedro M Duarte's answer might be the more correct one, given that it uses the multiindex as intended. However, for my real data, it would require either a lot of reworking, or a lot of typing, for a 7 levels deep multiindex and 50 data series, which is very error prone. @Nader Hisham ignored my request to stay in my multiindex, but it requires just the quick, easy and easy to check writing of a simple row of numbers (saving me a lot of time), that I can then remove after sorting. for other folks with a similar question, it might be different.

Community
  • 1
  • 1
JC_CL
  • 2,346
  • 6
  • 23
  • 36

2 Answers2

2
In[1]:
import pandas as pd
import numpy as np

header=pd.MultiIndex.from_product(
    [['location1','location2'],['S1','S2','S3']],
    names=['loc','S'])

df = pd.DataFrame(np.random.randn(5, 6), 
                  index=['a','b','c','d','e'], columns = header)

print(df)

Out[1]:
    loc location1                     location2                    
    S          S1        S2        S3        S1        S2        S3
    a    0.503357 -0.461202 -1.412865  0.866237  1.290292  0.635869
    b   -0.904658 -1.190422 -0.198654 -0.916884 -1.070291 -1.918091
    c   -1.448068 -0.121475 -0.838693  0.047861 -0.131904  1.154370
    d    1.758752 -0.094962 -2.035204 -0.399195 -0.756726  1.609393
    e    0.421521  1.134518 -0.809148 -0.543523 -1.161328  1.261901



In[2]:
distances = {
    ('location1','S1'): 200,
    ('location1','S2'): 760,
    ('location1','S3'): 10,
    ('location2','S1'): 1000,
    ('location2','S2'): 340,
    ('location2','S3'): 70,
}

index = df.columns
df.columns = pd.MultiIndex.from_tuples(
    [(key[0], key[1], distances[key],) for key in index.get_values()],
    names=[index.get_level_values(0).name,
           index.get_level_values(1).name,
           'dist']
)
print(df)

Out[2]:
    loc  location1                     location2                    
    S           S1        S2        S3        S1        S2        S3
    dist      200       760       10        1000      340       70  
    a     0.503357 -0.461202 -1.412865  0.866237  1.290292  0.635869
    b    -0.904658 -1.190422 -0.198654 -0.916884 -1.070291 -1.918091
    c    -1.448068 -0.121475 -0.838693  0.047861 -0.131904  1.154370
    d     1.758752 -0.094962 -2.035204 -0.399195 -0.756726  1.609393
    e     0.421521  1.134518 -0.809148 -0.543523 -1.161328  1.261901



In[3]:
result = df.sortlevel(level=2, axis=1)
print(result)

Out[3]:
    loc  location1 location2 location1 location2 location1 location2
    S           S3        S3        S1        S2        S2        S1
    dist      10        70        200       340       760       1000
    a    -1.412865  0.635869  0.503357  1.290292 -0.461202  0.866237
    b    -0.198654 -1.918091 -0.904658 -1.070291 -1.190422 -0.916884
    c    -0.838693  1.154370 -1.448068 -0.131904 -0.121475  0.047861
    d    -2.035204  1.609393  1.758752 -0.756726 -0.094962 -0.399195
    e    -0.809148  1.261901  0.421521 -1.161328  1.134518 -0.543523
Pedro M Duarte
  • 26,823
  • 7
  • 44
  • 43
  • Looks good. I have to play around a bit with it, to see if I can adapt it to my real data. Can you please elaborate a bit on the `df.columns = pd.MultiIndex.from_tuples( [(key[0], key[1], distances[key],) for key in index.get_values()], names=[index.get_level_values(0).name, index.get_level_values(1).name, 'dist'] )` part? I am not 100% sure what exactly it does. – JC_CL Nov 15 '15 at 17:31
  • The point of that line is to create the `MultiIndex` from scratch, keeping the level values from the original `index` and adding a new level for `dist`. I am using the `from_tuples` constructor for the `MultiIndex`. You have to pass it a list of tuples, where each tuple contains info for an entry in the index. You also pass it `names`, which will be used to set the level names. In this case the desired `MultiIndex` has three levels, therefore each tuple in the list of tuples has `len==3`, just like the `names` list. – Pedro M Duarte Nov 15 '15 at 18:49
  • To create the list of tuples needed in `from_tuples` I used a list comprehension, iterating over `index.get_values()`. Here `index.get_values()` gives us a handy representation of the original `MultiIndex` as a list of `tuples`. Since the original `MultiIndex` has two levels, I can access the values via `key[0]` and `key[1]`. – Pedro M Duarte Nov 15 '15 at 18:55
1
In [35]:
df.loc['dist' , : ] = [200,760,10,1000,340,70]
df
Out[35]:
loc                location1                 location2
S     S1              S2            S3            S1          S2    S3
a     0.348766  -0.326088   -0.891929   -0.704856   -1.514304   0.611692
b    -0.546026  -0.111232   -1.022104   -1.246002   0.328385    0.576465
c    -0.743512  -0.362791   -0.617021   -0.859157   -0.300368   0.292980
d     0.090178  1.369648    0.171753    -0.411466   0.478654    1.814878
e    -0.380414  -1.568492   -0.432858   1.034861    -0.633563   1.403627
dist 200.000000 760.000000  10.000000   1000.000000 340.000000  70.000000


In [36]:
order = np.argsort(df.loc['dist' , :]).values
order
Out[36]:
array([2, 5, 0, 4, 1, 3], dtype=int64)

In [37]:

df.iloc[: , order]
Out[37]:
loc    location1    location2   location1   location2   location1   location2
S            S3      S3            S1         S2          S2           S1
a     -0.891929    0.611692     0.348766    -1.514304   -0.326088   -0.704856
b     -1.022104    0.576465    -0.546026    0.328385    -0.111232   -1.246002
c     -0.617021    0.292980    -0.743512    -0.300368   -0.362791   -0.859157
d     0.171753     1.814878     0.090178    0.478654    1.369648    -0.411466
e     -0.432858    1.403627     -0.380414   -0.633563   -1.568492   1.034861
dist  10.000000    70.000000    200.000000  340.000000  760.000000  1000.000000

if you want to make your dist index as the first index you can do the following

Nader Hisham
  • 5,214
  • 4
  • 19
  • 35
  • You are just adding a new `dist` row to the data part, of the dataframe, but not to the index. I am not sure if that will not cause me issues with stuff like plotting the series. Will play wwith it a bit. – JC_CL Nov 15 '15 at 17:35
  • Yes, but it is an index like `a`, `b` and so on, instead of being part of the multiindex such as `loc` and `S`. When I do `df.plot()`, It also shows the `dist` part. – JC_CL Nov 15 '15 at 17:45
  • OK, i can just do your way, and after the sorting, delete the line `droppeddf = dfordered.drop('dist')` to get a proper plot. Seems a bit hacky, but maybe faster than Pedro M Duarte's solution. I will play with both for bit and see which works best for me. – JC_CL Nov 15 '15 at 17:54
  • I'll point out that I was shooting for generality rather than speed. Notice that I used a dictionary to obtain the value of `dist` from the values of `loc` and `S`. In your application perhaps you have a function that calculates `dist`, you could just plug in that function instead of the dictionary that I have used. – Pedro M Duarte Nov 15 '15 at 19:10