3

Why is .loc only returning a single row where multiple rows have the same MultiIndex?

Given the following dataframe

           col0      col1  col2
idx0 idx1
0    0      1.0  example1   1.0
     0      4.0  example2   8.0
     1      9.0  example3  27.0
     1     16.0  example4  64.0
1    0      0.5  example1   0.5
     0      2.0  example2   4.0
     1      4.5  example3  13.5
     1      8.0  example4  32.0

the .xs operation will select

In [121]: df.xs((0,1), level=[0,1])
Out[121]:
           col0      col1  col2
idx0 idx1
0    1      9.0  example3  27.0
     1     16.0  example4  64.0

whilst the .loc operation will select

In [125]: df.loc[[(0,1)]]
Out[125]:
           col0      col1  col2
idx0 idx1
0    1     16.0  example4  64.0

This is highlighted even further by the following

In [149]: df.loc[pd.IndexSlice[:, 1], :]
Out[149]:
           col0      col1  col2
idx0 idx1
0    1      9.0  example3  27.0
     1     16.0  example4  64.0

In [150]: df.loc[pd.IndexSlice[0, 1], :]
Out[150]:
col0          16
col1    example4
col2          64
Name: (0, 1), dtype: object

Set Up

import pandas as pd
import numpy as np
idx0 = range(2)
idx1 = np.repeat(range(2), 2)

midx = pd.MultiIndex(
    levels=[idx0, idx1],
    labels=[
        np.repeat(range(len(idx0)), len(idx1)),
        np.tile(range(len(idx1)), len(idx0))
    ],
    names=['idx0', 'idx1']
)

df = pd.DataFrame(
    [
        [i**2/float(j), 'example{}'.format(i), i**3/float(j)]
        for j in range(1, len(idx0) + 1)
        for i in range(1, len(idx1) + 1)
    ],
    columns=['col0', 'col1', 'col2'],
    index=midx
)
Alexander McFarlane
  • 10,643
  • 9
  • 59
  • 100
  • 1
    This is especially unusual given that, [with a basic Index, `loc` will return all instances of the label if you have duplicates.](https://stackoverflow.com/a/45636490/7954504) – Brad Solomon Sep 07 '17 at 14:02
  • 1
    FYI: If you initiate the index with `np.array` as `dtype=int` it still has problems so it is not an issue with floating points – Alexander McFarlane Sep 07 '17 at 14:17
  • 1
    Which version of pandas are you using? – Alexander Sep 07 '17 at 14:34
  • The latest: https://github.com/pandas-dev/pandas/releases/tag/v0.20.3 (upvoted you for a cool name) – Alexander McFarlane Sep 07 '17 at 14:42
  • I think this might fall under the category of a bug, I've submitted a github issue at: https://github.com/pandas-dev/pandas/issues/17464 – Alexander McFarlane Sep 07 '17 at 14:44
  • CONCLUSION (per link above): your directly constructing the MultiIndex is violating guarantees, namely that the levels are each unique. We don't explicitly check this as the public constructors guarantee this. To clarify, you certainly can have a non-unique MultiIndex (though generally discouraged as they are not that performant), but you would have duplicate labels, never level values. – Alexander Sep 08 '17 at 14:48

2 Answers2

1

Using .xs

df.xs((0,1), level=[0,1])
Out[74]: 
           col0      col1  col2
idx0 idx1                      
0    1      9.0  example3  27.0
     1     16.0  example4  64.0

Using .loc

df.loc[0].loc[1]
Out[75]: 
      col0      col1  col2
idx1                      
1      9.0  example3  27.0
1     16.0  example4  64.0

Add [] in your secondary index: (PS: link)

df.loc[(0, [1]),:]

Out[90]: 
           col0      col1  col2
idx0 idx1                      
0    1      9.0  example3  27.0
     1     16.0  example4  64.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • This doesn't answer the question really ... I'm have a workaround, I'm asking why `.loc[[(0,1)]]` doesn't work. I kindly thank you for your further example, though this seems to just emphasise the unexpected behavior of `.loc[[(0,1)]]`! – Alexander McFarlane Sep 07 '17 at 13:56
  • 2
    can you explain why the `[1]` is necessary? It seems quite arbitrary ! – Alexander McFarlane Sep 07 '17 at 14:09
  • @AlexanderMcFarlane Check the link http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers – BENY Sep 07 '17 at 14:16
  • The docs still don't show an example of why `[1]` should behave differently to `1`. Fundamentally, when doing `df.loc[pd.IndexSlice[:, 1], :]` I get all the duplicates but `df.loc[pd.IndexSlice[0, 1], :]` will only return one row. This is actually a really good example so I'll add to OP – Alexander McFarlane Sep 07 '17 at 14:23
1

I don't believe your multi-index is created correctly.

df = df.assign(
    idx0=[0] * 4 + [1] * 4, 
    idx1=[0, 0, 1, 1] * 2).set_index(['idx0', 'idx1'])

Using one of the correct ways to use loc for accessing the data:

>>> df.loc[(0, 1), :]
           col0      col1  col2
idx0 idx1                      
0    1        9  example3    27
     1       16  example4    64

Using the same command on the original dataframe, I get: TypeError: only integer arrays with one element can be converted to an index.

UPDATE

As I mentioned before, you do not appear to be creating your multi-index correctly. This dataframe with the properly constructed multi-index works as expected with your examples (using an older pandas, v 0.17.2).

midx = pd.MultiIndex.from_product([[0, 1], [0, 0, 1, 1]], names=['idx0', 'idx1'])
df = pd.DataFrame(
    [
        [i**2/float(j), 'example{}'.format(i), i**3/float(j)]
        for j in range(1, len(idx0) + 1)
        for i in range(1, len(idx1) + 1)
    ],
    columns=['col0', 'col1', 'col2'],
    index=midx)

Using midx as defined above:

>>> midx
MultiIndex(levels=[[0, 1], [0, 1]],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1]],
           names=[u'idx0', u'idx1'])

Using midx per your definition:

>>> midx
MultiIndex(levels=[[0, 1], [0, 0, 1, 1]],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]],
           names=[u'idx0', u'idx1'])
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • The `df` in the question has two `int64` indices, seems like it is created "properly". Can you explain why this behaves as expected with your modification? – Brad Solomon Sep 07 '17 at 14:12
  • yes I came across the same. I'll make my `MultiIndex` creation cleaner in the example at the bottom. I compacted it to make my post shorter but I have the same confusion as @BradSolomon – Alexander McFarlane Sep 07 '17 at 14:12
  • Thanks for the update. Is this issue a case of there being duplicate values in the `levels` kwarg? It seems that perhaps the Pandas lib should throw an error if `levels` can't take duplicates? – Alexander McFarlane Sep 07 '17 at 16:05
  • Although ill advised, a dataframe can have identical column names. So I believe `levels` should be able to accept duplicates for consistency. – Alexander Sep 07 '17 at 16:31