392

In R when you need to retrieve a column index based on the name of the column you could do

idx <- which(names(my_data)==my_colum_name)

Is there a way to do the same with pandas dataframes?

smci
  • 32,567
  • 20
  • 113
  • 146
ak3nat0n
  • 6,060
  • 6
  • 36
  • 59

10 Answers10

631

Sure, you can use .get_loc():

In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)

In [47]: df.columns.get_loc("pear")
Out[47]: 2

although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.

DSM
  • 342,061
  • 65
  • 592
  • 494
  • 16
    Column number is useful when using `.iloc` operator, where you must pass only integers for both rows and columns. – abe Sep 08 '17 at 18:52
  • 11
    Or when using libraries which want the DF converted to a numpy array and indices of columns with particular features. For example CatBoost wants a list of indices of categorical features. – Tom Walker Oct 13 '17 at 05:35
  • 4
    Is there a way to get list of indexes? – haneulkim Sep 03 '21 at 02:55
  • In my case I want to use the index of the column get values of "itertuples" by column name. Fetching the indeces of the column names instead of hardcoding keeps it dynamic in case of changes to the DF. – cyclux Mar 10 '22 at 11:00
  • 1
    it's also useful for the insert function (to enable you to insert after a given column) – seanv507 Nov 25 '22 at 16:43
  • This works good for superficially figuring out the column number in an excel file based on a df column header, after writing that df with xlsxwriter. – MrChadMWood Aug 21 '23 at 20:32
74

Here is a solution through list comprehension. cols is the list of columns to get index for:

[df.columns.get_loc(c) for c in cols if c in df]
cs95
  • 379,657
  • 97
  • 704
  • 746
snovik
  • 1,027
  • 10
  • 14
20

For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:

df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)

If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:

df = pd.DataFrame(
    {"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]}, 
    index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)

Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:

df = pd.DataFrame(
    {"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
    index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
cottontail
  • 10,268
  • 18
  • 50
  • 51
JE_Muc
  • 5,403
  • 2
  • 26
  • 41
18

DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
13

When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -

def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

Sample run -

In [162]: df
Out[162]: 
   apple  banana  pear  orange  peach
0      8       3     4       4      2
1      4       4     3       0      1
2      1       2     6       8      1

In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Divakar
  • 218,885
  • 19
  • 262
  • 358
13

Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs

In case you want the column name from the column location (the other way around to the OP question), you can use:

>>> df.columns.values()[location]

Using @DSM Example:

>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})

>>> df.columns

Index(['apple', 'orange', 'pear'], dtype='object')

>>> df.columns.values()[1]

'orange'

Other ways:

df.iloc[:,1].name

df.columns[location] #(thanks to @roobie-nuby for pointing that out in comments.) 
mallet
  • 2,454
  • 3
  • 37
  • 64
9

To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:

list(df.columns).index('pear')

Very straightforward and probably fairly quick.

JoeTheShmoe
  • 433
  • 6
  • 13
5

When the column might or might not exist, then the following (variant from above works.

ix = 'none'
try:
     ix = list(df.columns).index('Col_X')
except ValueError as e:
     ix = None  
     pass

if ix is None:
   # do something
QuentinJS
  • 162
  • 1
  • 9
4

how about this:

df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
Siraj S.
  • 3,481
  • 3
  • 34
  • 48
0
import random
def char_range(c1, c2):                      # question 7001144
    for c in range(ord(c1), ord(c2)+1):
        yield chr(c)      
df = pd.DataFrame()
for c in char_range('a', 'z'):               
    df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26)    # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15])                        # 15 Col View         

for col in df.columns:             # List of indices and columns
    print(str(df.columns.get_loc(col)) + '\t' + col)

![Results](Results