return slice of multi-indexed dataframe based on index length (hierarchical indexing)

Question

I have a dataframe fal_mean where the multi-indices are 'lat', 'lon', and, 'yr'. Below is just a snippet of the data, but what I need to do is to select data by the number of entries for the index 'yr' (years) for a given lat/lon combindation. For example, I would like to select only the data for each lat and lon combination where there are N or more entries (years of yearly data) such that len(fal_mean['yr'] >= N)

lat lon yr        101      102        103        104     105     109                                                             
35  -75 1978   410.63   410.63   1231.890   4516.950    0.00    0.00   
        1979     0.00   406.05   4060.460   6090.690  406.05    0.00   
        1980  1441.89   720.94  11535.100  12256.050    0.00    0.00   
        1982   121.54     0.00    850.790      0.000    0.00    0.00   
        1983   638.82  2874.68  37272.955   1916.455    0.00  638.82

' I can select data based on the lat and lon combinations individually for whatever column I need:

fal_mean.loc[(35, -75), '101']
Out[74]: 
yr
1978     410.630000
1979       0.000000
1980    1441.890000
1982     121.540000
1983     638.820000
1984    5419.645000
1985      99.836667
1986    7113.757500
1987       0.000000
1988       0.000000
1997     408.410000
1998     251.690000
2001     208.040000
2002       0.000000
2004       0.000000
2005       0.000000
2006       0.000000
2007       0.000000
2009       0.000000
2010       0.000000
2011       0.000000
Name: 101, dtype: float64

and get the length of that selection:

len(fal_mean.loc[(35, -75), '101'])
Out[75]: 21

But I cannot figure out how to return a slice of the data frame using
len(fal_mean.loc[(35, -75), '101']) > N

something like:

fal_mean.loc[(len(fal_mean.loc[(35, -75), '101']) > 32)]

Except I'd need to pass all values for the lat and lon indices.

EDIT 1

Here is a dataframe containing the lengths of each combination (index=lat, columns=lon), so I just need to be able to call those combinations >N with some slice notation on the origanl dataframe 'fal_mean'

I created the dataframe using:

cv=np.arange(35, 45, 1)
cv2=np.arange(-75, -66, 1)
fal=pd.DataFrame(data=None, index=cv, columns=cv2)
for lat in cv:
    for lon in cv2:
        fal.loc[lat, lon]=len(fal_mean.xs((lat, lon), level=('lat', 'lon')))

    -75 -74 -73 -72 -71 -70 -69 -68 -67
35  21  8   0   0   0   0   0   0   0
36  33  32  0   0   0   0   0   0   0
37  33  35  0   0   0   0   0   0   0
38  13  35  32  8   0   0   0   0   0
39  0   31  35  35  25  9   16  0   0
40  0   0   35  32  36  37  37  37  37
41  0   0   0   0   34  33  36  37  37
42  0   0   0   0   0   37  37  35  37
43  0   0   0   0   0   32  35  31  34
44  0   0   0   0   0   0   0   12  31

I'm hoping there is some clever slice notation I could use to do this. I have tried xs and df.index.get_level_values but my syntax is wrong.

e.g. fal_mean.loc[:, len(fal_mean.index.get_level_values('yr') >32)] does not work.

If there isn't a way to do this in one line, I could call a boolean array on this matrix and hopefully use .loc to call the 0 and 1 indices (lat, lon) for those combinations in the original dataframe, but I'm not sure yet how to go about that.

fal >=32

    -75     -74     -73     -72     -71     -70     -69     -68     -67
35  False   False   False   False   False   False   False   False   False
36  True    True    False   False   False   False   False   False   False
37  True    True    False   False   False   False   False   False   False
38  False   True    True    False   False   False   False   False   False
39  False   False   True    True    False   False   False   False   False
40  False   False   True    True    True    True    True    True    True
41  False   False   False   False   True    True    True    True    True
42  False   False   False   False   False   True    True    True    True
43  False   False   False   False   False   True    True    False   True
44  False   False   False   False   False   False   False   False   False

EDIT 2: here are the first 50 rows and 4 columns, there are 28+ columns, but that is not important at this point. I would like to keep all columns at only the latitude and longitude combinations with at least (N) entries for year index - I'm thinking 32 may be sufficient. That would exclude data from lat/lon combinations (35, (all)), (36, -74), (38, -75), etc. using the dataframe above as a guide.

These data were generated by a groupby function grouped=nm.groupby(['lat', 'lon', 'yr']) so if I could modify that call with some limiter for the resulting lengths of grouped.year that would work also. I tried df.groupby(['key1', 'key2', 'key3']).size() and this gives the number of data points that were binned to each lat lon combination for each year.

                       101          102            103           104  
lat lon yr                                                             
35  -75 1978    410.630000   410.630000    1231.890000   4516.950000   
        1979      0.000000   406.050000    4060.460000   6090.690000   
        1980   1441.890000   720.940000   11535.100000  12256.050000   
        1982    121.540000     0.000000     850.790000      0.000000   
        1983    638.820000  2874.680000   37272.955000   1916.455000   
        1984   5419.645000   516.155000   12778.280000     96.990000   
        1985     99.836667     0.000000    2819.083333  13977.286667   
        1986   7113.757500  3909.402500   10313.850000   2823.675000   
        1987      0.000000   708.610000    6841.430000      0.000000   
        1988      0.000000     0.000000    4711.750000      0.000000   
        1997    408.410000   204.210000       0.000000      0.000000   
        1998    251.690000     0.000000   13507.510000      0.000000   
        2001    208.040000   485.420000   17474.970000      0.000000   
        2002      0.000000     0.000000    2173.193333      0.000000   
        2004      0.000000     0.000000       0.000000    465.480000   
        2005      0.000000     0.000000     231.600000      0.000000   
        2006      0.000000     0.000000    1209.420000      0.000000   
        2007      0.000000     0.000000       0.000000      0.000000   
        2009      0.000000     0.000000    3561.485000      0.000000   
        2010      0.000000     0.000000     347.500000      0.000000   
        2011      0.000000     0.000000    4742.660000      0.000000   
    -74 1994    908.000000   113.500000    1589.000000      0.000000   
        1996    231.960000     0.000000   55207.410000      0.000000   
        1999  18414.830000   575.460000  142139.500000    575.460000   
        2000      0.000000     0.000000   19388.510000    842.980000   
        2003   4654.030000     0.000000    2694.440000      0.000000   
        2005   6229.700000  1245.940000    8098.600000      0.000000   
        2006   3294.950000  6243.060000   14740.550000    346.840000   
        2013  19512.960000  9420.050000   10261.130000      0.000000   
36  -75 1977    426.498667   612.972667    5114.548667   6870.683333   
        1978    213.488571  1339.422857    7686.621429  12274.601429   
        1979    140.881333   536.751333    5403.046000   7349.630667   
        1980   3669.234667   552.227333   68924.586000   6117.865333   
        1981     24.875455   322.680909    6340.275455  22119.756364   
        1982   9261.615714  3890.987143   16226.872857    122.911429   
        1983    199.143333   478.403333   70658.200476    524.357143   
        1984   8193.242857   996.810714   43920.052143   3900.735714   
        1985    514.931364    23.260455    8051.927727    924.295909   
        1986   2884.836667  4156.231667   23677.898000   1393.353667   
        1987   2555.915417  2792.415000   25936.742917    833.030833   
        1988   1410.285000  1159.560000    7141.306667      0.000000   
        1993   6952.696667  9823.026667   64672.390000   5508.930000   
        1994    214.590000     0.000000  127691.851429   4720.532857   
        1995      0.000000     0.000000    3534.124000      0.000000   
        1996      0.000000     0.000000   63118.047500      0.000000   
        1997    175.766000   421.272000   23034.918000    527.300000   
        1998    105.652000     0.000000    2508.806000      0.000000   
        1999      0.000000     0.000000    1245.455000      0.000000   
        2000      0.000000     0.000000   10889.758000      0.000000   
        2001    948.245000   441.873750   27449.827500    138.930000

I had used xs to create the matrix and just edited the question to show my code, but I am still unsure how to get at just the data I need in one line as a slice. Like I said, I can find the length, and I can find the values, but I can't find the values for a certain length. I may have to use another loop to do this, but I was hoping for something simpler. Thanks for your comment. — Ryan, Nov 06 '14 at 01:58

score 0 · Answer 1 · answered Nov 06 '14 at 13:38

0

I'm not sure to understand correctly your question. Do you want to have all the [lat,lon] combinations having more than 30 year entries?

If so, you could probably use something like

many_years = df[df.groupby(level=[0,1,2])['101'].count()>=30]

I can try to help better if you can provide a minimal example of your Data Frame, ideally that we can import via pd.read_clipboard() in ipython, and the result you want.

answered Nov 06 '14 at 13:38

Daniele

553
2
12

I added some more data in my edit above. Yes, that is what I am after. I have a `.count` output from my groupby function, but it yields the number of samples that were included in that lat,lon bin which were averaged together for that year. I am binning samples to 1 degree bins and taking the means for each year. I can only use mostly full records (>32 years per location). – Ryan Nov 06 '14 at 14:55
OK, after some digging, xs MIGHT work IF I can pass a list to the row selector. Something like `df.xs(({lat_list},), level='lat', ({lon_list},), level='lon')[:]` – Ryan Nov 06 '14 at 21:12
Or not... I need combinations of lat and lon together, so passing a list of lats and lons separately to be used as keys will not remove anything... – Ryan Nov 06 '14 at 21:18

score 0 · Answer 2 · answered Nov 14 '14 at 14:45

In the interest of SO, here is the solution I ended up with. No, I did not find an elegant one line solution, and I had to use a loop. But it gets the job done.

# remove lat lon combinations where total sample years are less than Num
Num=32 # set criteria ~ 2014-1977 = 37 -5 ->32
def remove_N (nm, cv, cv2): # remove lat lon combinations where total sample years are less than N
    nm['id']=-1
    ll_index=0
    keep = list()
    for lat in cv:
        for lon in cv2:
            if len(nm.xs((lat, lon), level=('lat', 'lon'))) == 0:
                continue
            if len(nm.xs((lat, lon), level=('lat', 'lon'))) >= Num: # N
                keep.append(ll_index)
            t = nm.loc[lat,lon]
            t.id = ll_index
            ll_index = ll_index + 1
        data = nm.ix[nm ['id'].isin(keep)]
    data=data.drop('id', 1)
    return(data)

return slice of multi-indexed dataframe based on index length (hierarchical indexing)

2 Answers2