1

This question is part #2 of my previous one

For example I have DF like that:

df = pd.DataFrame({
    'A': [[e for e in xrange(x+1, x+4)] for x in xrange(0, 15, 3)],
    'B': [[e*10 for e in xrange(x+1, x+4)] for x in xrange(0, 15, 3)],
    'C': [[e*100 for e in xrange(x+1, x+4)] for x in xrange(0, 15, 3)]
})

              A                B                   C
0     [1, 2, 3]     [10, 20, 30]     [100, 200, 300]
1     [4, 5, 6]     [40, 50, 60]     [400, 500, 600]
2     [7, 8, 9]     [70, 80, 90]     [700, 800, 900]
3  [10, 11, 12]  [100, 110, 120]  [1000, 1100, 1200]
4  [13, 14, 15]  [130, 140, 150]  [1300, 1400, 1500]

And I need to get row where 'A' contains 10.
Now Im using:

f = lambda x: 10 in x
mask = df['A'].apply(f)
df[mask] 

My questions are:

  • Is that OK method for retrieving by membership testing? Is there better?
  • Is putting lists (and sets) in DF cells OK at all?
Community
  • 1
  • 1
Gill Bates
  • 14,330
  • 23
  • 70
  • 138

1 Answers1

1

You are much better off constructing a multi-indexed frame. This is MUCH faster as these are native types for the underlying data (hint: do df.dtypes on your frame, they will be object)

In [3]: A = pd.DataFrame([[e for e in xrange(x+1, x+4)] for x in xrange(0, 15, 3)])

In [4]: B = pd.DataFrame([[e*10 for e in xrange(x+1, x+4)] for x in xrange(0, 15, 3)])

In [5]: C = pd.DataFrame([[e*100 for e in xrange(x+1, x+4)] for x in xrange(0, 15, 3)])

# this creates a 2-level hierarchy
In [9]: df = pd.concat([A,B,C],keys=['A','B','C'],axis=1)

Out[8]: 
    A            B               C            
    0   1   2    0    1    2     0     1     2
0   1   2   3   10   20   30   100   200   300
1   4   5   6   40   50   60   400   500   600
2   7   8   9   70   80   90   700   800   900
3  10  11  12  100  110  120  1000  1100  1200
4  13  14  15  130  140  150  1300  1400  1500

# select out A
In [14]: df['A']
Out[14]: 
    0   1   2
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12
4  13  14  15

# this is a boolean array
In [11]: df['A']>10
Out[11]: 
       0      1      2
0  False  False  False
1  False  False  False
2  False  False  False
3  False   True   True
4   True   True   True

selecting specific slices

In [26]: df.ix[:,('A',1)]
Out[26]: 
0     2
1     5
2     8
3    11
4    14
Name: (A, 1), dtype: int64
Jeff
  • 125,376
  • 21
  • 220
  • 187