2

I have a Pandas dataframe in Python (3.6) with numeric and categorical attributes. I want to pull a list of numeric columns for use in other parts of my code. My question is what is the most efficient way of doing this?

This seems to be the standard answer:

num_cols = df.select_dtypes([np.number]).columns.tolist()

But I'm worried that select_dtypes() can be slow and this seem to add a middle step that I'm hoping isn't necessary (subsetting the data before pulling back the column names of just the numeric attributes).

Any ideas on a more efficient way of doing this? (I know there is a private method _get_numeric_data() that could also be used, but wasn't able to find out how that works and I don't love using a private method as a long-term solution).

user1895076
  • 709
  • 8
  • 19

2 Answers2

3

df.select_dtypes is for selecting data, it makes a copy of your data, which you essentially discard, by then only selecting the columns. This is an inefficent way. Just use something like:

df.columns[[np.issubdtype(dt, np.number) for dt in df.dtypes]]
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • I wrapped this in list() to return it as a list, but essentially this is exactly what I was looking for, thanks! And thanks for confirming that select_dtypes is inefficient. It didn't make sense to me that that would be the best way to get JUST the column names, which is what I was after. – user1895076 Sep 26 '17 at 17:37
  • actually it doesn't generally copy; this is only a view; it is exactly like subsetting the columns themselves – Jeff Sep 26 '17 at 21:35
  • @Jeff it was my impression that it is generally hard to predict whether this returns a view or a copy, and it depends on the memory layout of the underlying array. I suspect, though, if you have mixed numeric dtypes, it almost certainly will return a copy rather than a view. – juanpa.arrivillaga Sep 26 '17 at 21:41
  • no this will almost always return a view; data is already stored dtype segregated; the memory layout is not relevant here; when you subset memory layout comes into play – Jeff Sep 26 '17 at 21:45
  • @Jeff um, I just tested this out with [the example from the docs](https://gist.github.com/juanarrivillaga/3413004317b2826842412d69e88c3e29), and it returns a copy, not a view, or is there something I am fundamentally misunderstanding? – juanpa.arrivillaga Sep 26 '17 at 21:54
  • it shouldn't ; could be a bug in that it does actually invoke copy (it's not a view directly) – Jeff Sep 26 '17 at 21:55
  • @Jeff Well, looking at the [source](https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/core/frame.py#L2188-L2303) it returns `self.loc[com._get_info_slice(self, dtype_indexer)]`, and there isn't any obvious place where an explicit copy is being made, although some stuff is hidden behind [helper-functions](https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/core/common.py#L148), but again, nothing obvious as far as I can tell. Are you saying since only the columns are subset, then it should reliably return a view? – juanpa.arrivillaga Sep 26 '17 at 22:11
  • yes; these grab whole blocks by dtype so they get returned directly; how exactly are you testing whether it's an actual copy? this is actually non trivial with mixed dtypes – Jeff Sep 26 '17 at 22:13
  • @Jeff [here's a dump from an IPython session](https://gist.github.com/juanarrivillaga/3413004317b2826842412d69e88c3e29) – juanpa.arrivillaga Sep 26 '17 at 22:14
  • you are misunderstanding; that actually indicates u have a view into the original frame; this is certainly a view here; look at df.c.values.base (same as original frame); – Jeff Sep 26 '17 at 22:16
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/155365/discussion-between-juanpa-arrivillaga-and-jeff). – juanpa.arrivillaga Sep 26 '17 at 22:19
  • @Jeff, eh, if it were a *view*, then would't the changes in `sub` be reflected in `df`? Like if I did `b = df.loc[:,'b']` then `b.iloc[0,0] = False` I should see `False` in `df.loc[0, 'b']` – juanpa.arrivillaga Sep 26 '17 at 22:28
  • sure for some dtypes (boolean is not one of them); but floats sure – Jeff Sep 26 '17 at 22:30
  • @Jeff what? My point is that there is a `999.99` in `sub`, which is the object returned by `df. select_dtypes`, but none in `df`, so `sub` cannot be a view. What am I not getting? Indeed, for that `bool` column, I did see the view behavior when I do `b = df.loc[:,'b']` – juanpa.arrivillaga Sep 26 '17 at 22:31
  • .loc can create a copy,[] generally will not; this is why view semantics are tricky; .loc[:, ] is a filter even though it doesn't look like one – Jeff Sep 26 '17 at 22:53
  • @Jeff well, exactly. My point is, `sub` in the example I gave *shows copy behavior*. – juanpa.arrivillaga Sep 26 '17 at 22:54
  • well pls file an issue then; this should be a view – Jeff Sep 26 '17 at 23:03
0

Two ways (without using df.select_dtypes which unnecessarily creates a temporary intermediate dataframe):

import numpy as np
[c for c in df.columns if np.issubdtype(df[c].dtype, np.number)]
from pandas.api.types import is_numeric_dtype
[c for c in df.columns if is_numeric_dtype(c)]

Or if you want the result to be a pd.Index rather than just a list of column name strings as above, here are three ways (first is from @juanpa.arrivillaga):

import numpy as np
df.columns[[np.issubdtype(dt, np.number) for dt in df.dtypes]]
from pandas.api.types import is_numeric_dtype
df.columns[[is_numeric_dtype(c) for c in df.columns]]
from pandas.api.types import is_numeric_dtype
df.columns[list(map(is_numeric_dtype, df.columns))]

Some other solutions consider a bool column to be numeric, but the solutions above do not (tested with numpy 1.22.3 / pandas 1.4.2).

dabru
  • 786
  • 8
  • 8