3

I need to get the column names of a pandas DataFrame where the columns match those in a numpy array.

Example

import numpy as np
import pandas as pd

x = pd.DataFrame( data=[[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]], columns=list('abc') )

y = np.array( x[['b','c']] )
y

y has then the second and third columns from the DataFrame:

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

How can I get the column names where y is in x? (In this case b, c)

I am looking for something like:

x[ x==y ].columns

or

pd.DataFrame(y).isin(x)

The example is motivated by a feature selection problem, and was taken from the sklearn page.


I am using numpy 1.11.1 and pandas 0.18.1.

Luis
  • 3,327
  • 6
  • 35
  • 62

2 Answers2

5

Here's an approach with NumPy broadcasting -

x.columns[(x.values[...,None] == y[:,None]).all(0).any(1)]
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • That's cool! Could you please explain what is `x.values[...,None]`? – MaxU - stand with Ukraine Nov 14 '16 at 19:35
  • 1
    @MaxU Well we are introducing a singleton dim at the end of that extracted array. Essentialy it's `x.values[:,:,None]`. With that `ellipsis`, we are just replacing `:,:`. From this [post](http://stackoverflow.com/a/773472/3293881) : `"Ellipsis is used here to indicate a placeholder for the rest of the array dimensions not specified. "`. – Divakar Nov 14 '16 at 19:38
  • I love this answer. I started to do this too! But that's just because of the things you've taught me ;-) – piRSquared Nov 14 '16 at 19:56
  • @piRSquared I didn't realize I taught you broadcasting! :) – Divakar Nov 14 '16 at 19:57
  • 1
    @Divakar when first exposed to broadcasting, it makes sense (to me). But it is still tricky to keep track of what is going where and when. Looking at your examples (over and over) has taught me much and has allowed me to traverse a learning curve that would've been steeper without you. – piRSquared Nov 14 '16 at 20:00
  • @piRSquared Ah that's nice to hear! Keep an eye on the memory usage though. Nevertheless, would love to see more of broadcasting in `pandas`! – Divakar Nov 14 '16 at 20:04
  • 1
    @Divakar, piRSquared it's also still tricky for me too, but it's extremely fast and useful. Thank you Divakar! – MaxU - stand with Ukraine Nov 14 '16 at 20:09
  • Thank you! Name an answer that answers the question and teaches something new too! I guess broadcasting's the next thing to grasp in my list now ;) – Luis Nov 16 '16 at 10:28
1

Maybe this?

import numpy as np
import pandas as pd

x = pd.DataFrame( data=[[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]], columns=list('abc') )

y = np.array( x[['b','c']] )

for yj in y.T:
    for xj in x:
        if (all(x[xj] == yj)):
            print(xj)
John Smith
  • 1,077
  • 6
  • 21