6

I would like to select from a pandas dataframe specific columns using column index.

In particular, I would like to select columns index by the column index generated by c(12:26,69:85,96:99,134:928,933:935,940:967) in R. I wonder how can I do that in Python?

I am thinking something like the following, but of course, python does not have a function called c()...

input2 = input2.iloc[:,c(12:26,69:85,96:99,134:928,933:935,940:967)]
smci
  • 32,567
  • 20
  • 113
  • 146
user5309995
  • 61
  • 1
  • 2
  • 4
  • 6
    From TFM - http://pandas.pydata.org/pandas-docs/stable/indexing.html - `You can pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised. Multiple columns can also be set in this manner.` – hrbrmstr Sep 07 '15 at 17:48
  • Thanks @hrbrmstr for your prompt responses! I have read the help file in the link you posted, but still do not know how to solve my problem...I do not know how to create the list of column index fast, like in R I can use `c(12:26,69:85,96:99,134:928,933:935,940:967)`, but I do not know how to do that in Python. Thanks! – user5309995 Sep 07 '15 at 17:56
  • 4
    `list(range(12, 26) + range(69, 85) + range(96, 99) + range(134, 928) + range(933, 935) + range(940, 967))` – hrbrmstr Sep 07 '15 at 18:30
  • Do you only want the equivalent of `c()` for **(numerical) dataframe column indices,** or also for concatenating **(string) column names** ('labels' in Pandas terminology)? `pandas.loc[:, ['a','b','c']]` can handle both, whereas `numpy.r_` only works on numerical indices, not string labels – smci Nov 15 '19 at 01:11

3 Answers3

7

The equivalent is numpy's r_. It combines integer slices without needing to call ranges for each of them:

np.r_[2:4, 7:11, 21:25]
Out: array([ 2,  3,  7,  8,  9, 10, 21, 22, 23, 24])

df = pd.DataFrame(np.random.randn(1000))
df.iloc[np.r_[2:4, 7:11, 21:25]]
Out: 
           0
2   2.720383
3   0.656391
7  -0.581855
8   0.047612
9   1.416250
10  0.206395
21 -1.519904
22  0.681153
23 -1.208401
24 -0.358545
smci
  • 32,567
  • 20
  • 113
  • 146
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • 3
    Wow. Surprised this isn't voted more. While other answers might be more pythonic, it surprised me coming from `R` that `python` was sooooo verbose. This is the true analog to `c()`, though I wonder why the dunder... does that imply it's a quasi-private method? – Hendy Sep 28 '17 at 13:56
  • @Hendy Python is a general-purpose language so many of the things that R offers out of the box (let's say vector things) are provided by third party libraries in Python (such as numpy and pandas). I guess that's the reason for verbosity. – ayhan Sep 28 '17 at 20:07
  • Not really. `pandas.loc[:, ['a','b','c']]` can handle both, whereas `numpy.r_` only works on numerical indices, not string labels. I have never needed `numpy.r_`, and I've never seen it used in pandas code either. – smci Nov 15 '19 at 01:16
  • @smci You cannot pass non contiguous slices to loc or iloc without a helper like np.r_. That's the whole point of the question. – ayhan Nov 15 '19 at 22:50
  • @ayhan: yes you can, you just [use list notation on the expanded slices](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-integer) e.g. `df.iloc[[1, 3, 8, 9, 10], [1, 3]]`. I've never seen `numpy.r_` used in pandas. – smci Nov 15 '19 at 23:05
  • @smci Those are not noncontiguous slices but a list of indices. This is like saying you don't need a loop you can just write the same command 100 times. You can refer to the [docs](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html#slicing-with-r-s-c) if you haven't seen before. – ayhan Nov 15 '19 at 23:16
  • @ayhan: those are noncontiguous slices expanded into lists of indices, which is what most pandas users do. Also `r_` doesn't work on (string) labels; only works on numerical indices. `np.r_(['foo','bar')` fails with `TypeError`. – smci Nov 16 '19 at 00:00
  • @smci A slice is an object in Python which can be explicitly constructed with the slice constructor (`a = slice(3)`, `b = slice(20, 23)` ) or with a special syntax in some methods like `__getitem__` or `__setitem__` (for example `np_r_[1:10:2]`). `a` and `b` are noncontiguous slices. `[0, 1, 2, 20, 21, 22]` is a list of integers. How you interpret them doesn't change their definition. `df.iloc[[a, b]]` fails but `df.iloc[np.r_[a, b]]` does not. --- String indices are irrelevant to the question (as in `c('foo':'bar')` does not compute). – ayhan Nov 16 '19 at 00:26
  • @ayhan: String indices ('labels') are entirely relevant to the question, since they're one of the three main ways pandas handles indexing. `df.loc[:, ['foo', 'bar']]` works fine. The question did not say restrict things to slices, it didn't mention 'slice' at all. Yes we both know what a slice object is. Anyway, `r_` is only a partial solution like I'm saying. – smci Nov 16 '19 at 00:47
  • @smci The question is not about the main ways pandas handles indexing. The question states it requirement through an example and that example contains slices. Strings, on the other hand, appear nowhere. It is about representing `c(12:26,69:85,96:99,134:928,933:935,940:967)` in Python. My answer is `np.r_[12:26,69:85,96:99,134:928,933:935,940:967]`. If you'd like to share your *complete solution* by typing 856 integers by hand you can click on the answer button there: ↓ – ayhan Nov 16 '19 at 15:02
  • @ayhan: the question title says nothing about (numerical) slices but the body does. The question title is much broader than the body. Hence it's ambiguous. There are multiple ways to index columns, and integer slices is only one. – smci Nov 16 '19 at 21:28
5

Putting @hrbrmstr 's comment into an answer, because it solved my issue and I want to make it clear that this question is resolved. In addition, please note that range(a,b) gives the numbers (a, a+1, ..., b-2, b-1), and doesn't include b.

R's combine function

c(4,12:26,69:85,96:99,134:928,933:935)

is translated into Python as

[4] + list(range(12,27)) + list(range(69,86)) + list(range(96,100)) + list(range(134,929)) + list(range(933,936))
tshynik
  • 127
  • 1
  • 8
1

To answer the actual question,

Python equivalent of R c() function, for dataframe column indices?

I'm using this definition of c()

c = lambda v: v.split(',') if ":" not in v else eval(f'np.r_[{v}]')

Then we can do things like:

df = pd.DataFrame({'x': np.random.randn(1000),
                   'y': np.random.randn(1000)})
# row selection
df.iloc[c('2:4,7:11,21:25')] 

# columns by name
df[c('x,y')] 

# columns by range
df.T[c('12:15,17:25,500:750')]

That's pretty much as close as it gets in terms of R-like syntax.

To the curious mind

Note there is a performance penality in using c() as per above v.s. np.r_. To paraphrase Knuth, let's not optimize prematurely ;-)

%timeit np.r_[2:4, 7:11, 21:25]
27.3 µs ± 786 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit c("2:4, 7:11, 21:25")
53.7 µs ± 977 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
miraculixx
  • 10,034
  • 2
  • 41
  • 60