8

We have both code popping up in our codebase

pandas.DataFrame.columns.values.tolist()
pandas.DataFrame.columns.tolist()

Are these always identical? I'm not sure why the values variant pops up in the places it does, seems like the direct columns.tolist() is all that's needed to get the column names. I'm looking to clean up the code a bit if this is the case.

Introspecting a bit seems to suggest values is just some implementation detail being a numpy.ndarray

>>> import pandas
>>> d = pandas.DataFrame( { 'a' : [1,2,3], 'b' : [0,1,3]} )
>>> d
   a  b
0  1  0
1  2  1
2  3  3
>>> type(d.columns)
<class 'pandas.core.indexes.base.Index'>
>>> type(d.columns.values)
<class 'numpy.ndarray'>
>>> type(d.columns.tolist())
<class 'list'>
>>> type(d.columns.values.tolist())
<class 'list'>
>>> d.columns.values
array(['a', 'b'], dtype=object)
>>> d.columns.values.tolist()
['a', 'b']
>>> d.columns
Index(['a', 'b'], dtype='object')
>>> d.columns.tolist()
['a', 'b']
jxramos
  • 7,356
  • 6
  • 57
  • 105

2 Answers2

10

Output is same, but if really big df timings are different:

np.random.seed(23)
df = pd.DataFrame(np.random.randint(3, size=(5,10000)))
df.columns = df.columns.astype(str)
print (df)

In [90]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 79.5 µs per loop

In [91]: %timeit df.columns.tolist()
10000 loops, best of 3: 173 µs per loop

Also uses different functions:

Index.values with numpy.ndarray.tolist

Index.tolist

Thanks Mitch for another solution:

In [93]: %timeit list(df.columns.values)
1000 loops, best of 3: 169 µs per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Very good, I found this [answer](https://stackoverflow.com/a/29494537/1330381) to a question about how to get the column headers as a list and it looks like there was an API simplification at some point. – jxramos Jul 17 '17 at 20:01
  • 1
    `tolist` for an Index object just calls `list(self.values)` from what I can see, so the performance difference seen here is just `df.columns.values.tolist()` vs `list(df.columns.values)`. – miradulo Jul 17 '17 at 20:03
  • 3
    And I'm not advocating for that solution necessarily, I just wanted to point out that `list(df.columns.values)` is wholly equivalent to `df.columns.tolist()`. So the performance difference here is the `list` built-in function versus the NumPy array `.tolist()`, the latter of which seems faster. – miradulo Jul 17 '17 at 20:08
1
d = pandas.DataFrame( { 'a' : [1,2,3], 'b' : [0,1,3]} )

or you can simply do

list(d)# it is same with d.columns.tolist()
Out[327]: ['a', 'b']

#  Time 
% timeit list(df) # after run the time , this is the slowest on my side . 
10000 loops, best of 3: 135 µs per loop
BENY
  • 317,841
  • 20
  • 164
  • 234