How to select range of columns in a dataframe based on their name and not their indexes?

Question

In a pandas dataframe created like this:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(10, size=(6, 6)),
                  columns=['c' + str(i) for i in range(6)],
                  index=["r" + str(i) for i in range(6)])

which could look as follows:

    c0  c1  c2  c3  c4  c5
r0   2   7   3   3   2   8
r1   6   9   6   7   9   1
r2   4   0   9   8   4   2
r3   9   0   4   3   5   4
r4   7   6   8   8   0   8
r5   0   6   1   8   2   2

I can easily select certain rows and/or a range of columns using .loc:

print df.loc[['r1', 'r5'], 'c1':'c4']

That would return:

    c1  c2  c3  c4
r1   9   6   7   9
r5   6   1   8   2

So, particular rows/columns I can select in a list, a range of rows/columns using a colon.

How would one do this in R? Here and here one always has to specify the desired range of columns by their index but one cannot - or at least I did not find it - access those by name. To give an example:

df <- data.frame(c1=1:6, c2=2:7, c3=3:8, c4=4:9, c5=5:10, c6=6:11)
rownames(df) <- c('r1', 'r2', 'r3', 'r4', 'r5', 'r6')

The command

df[c('r1', 'r5'),'c1':'c4']

does not work and throws an error. The only thing that worked for me is

df[c('r1', 'r5'), 1:4]

which returns

   c1 c2 c3 c4
r1  1  2  3  4
r5  5  6  7  8

But how would I select the columns by their name and not by their index (which might be important when I drop certain columns throughout the analysis)? In this particular case I could of course use grep but how about columns that have arbitrary names?

So I don't want to use

df[c('r1', 'r5'),c('c1','c2', 'c3', 'c4')]

but an actual slice.

EDIT:

A follow-up question can be found here.

@downvoter: Could you please explain why you downvoted? It is a clear question with a minimal reproducible example, so what is the problem? — Cleb, Jun 08 '16 at 23:16
Feels a little like moving the goalposts to ask a question specifically about columns and then edit it to include rows after an answer is submitted. Might be better to roll back the edits and ask a new question. They seem very related, but column names and row names are treated quite differently in R data frames. (Though not my downvote - and not sure if this is the reason.) — Gregor Thomas, Jun 08 '16 at 23:19
@Gregor: Ok, I am not that familiar with the R specifics, in Pandas it is straight forward, I can just do: `df.loc['r1':'r3', 'c1':'c4']` so rows and columns are not treated that differently. I will of course accept the provided answer (see my first comment below it) if nothing better shows up because it indeed answered the original question. And yes, you are right, I should have put the row selection in the original question, so I understand your downvote; thanks for explaining it! — Cleb, Jun 08 '16 at 23:24
Even from the start, I think it would work better as a separate question (but still no downvotes from me). I can think of three good methods for columns: `base::subset` as in the answer or `dplyr::select` or `data.table`. I don't know of any methods for row names other than hacking one. But it would make a good separate question and maybe garner some specific interest/innovative solutions. — Gregor Thomas, Jun 08 '16 at 23:27
@Gregor: Ok, thanks. As mentioned, I had no idea that row and column selection are treated that differently, I edit my question again; thanks for making me aware of that. — Cleb, Jun 08 '16 at 23:30
@Cleb, downvotes. likely because you have python and pandas tag, and the question is about "R". — Merlin, Jun 08 '16 at 23:36
@Merlin: That was to attract the pandas experts that might know how to do this in R. Wrong tags would be strange reasons for a downvote in my opinion but well,... thanks for editing. — Cleb, Jun 08 '16 at 23:38

evan.oman · Accepted Answer · 2016-06-08T23:14:18.223

14

It looks like you can accomplish this with a subset:

> df <- data.frame(c1=1:6, c2=2:7, c3=3:8, c4=4:9, c5=5:10, c6=6:11)
> rownames(df) <- c('r1', 'r2', 'r3', 'r4', 'r5', 'r6')
> subset(df, select=c1:c4)
   c1 c2 c3 c4
r1  1  2  3  4
r2  2  3  4  5
r3  3  4  5  6
r4  4  5  6  7
r5  5  6  7  8
r6  6  7  8  9
> subset(df, select=c1:c2)
   c1 c2
r1  1  2
r2  2  3
r3  3  4
r4  4  5
r5  5  6
r6  6  7

If you want to subset by row name range, this hack would do:

> gRI <- function(df, rName) {which(match(rNames, rName) == 1)}
> df[gRI(df,"r2"):gRI(df,"r4"),]
   c1 c2 c3 c4 c5 c6
r2  2  3  4  5  6  7
r3  3  4  5  6  7  8
r4  4  5  6  7  8  9

edited Jun 08 '16 at 23:14

answered Jun 08 '16 at 22:51

evan.oman

5,922
22
43

That indeed works. How would you now also select rows at the same time? If you want specific rows then `subset(df[c('r1', 'r3'),], select=c1:c4)` would work but how about a range of rows (see my edit)? Upvote for now and might accept it later on depending on other answers' quality... – Cleb Jun 08 '16 at 22:56
2

I think standard practice would be to not name your rows and then use the standard index ranges to subset the rows. If you need row names you could always add them as an id column. – evan.oman Jun 08 '16 at 23:06
That might be a nice work-around. But still feels strange that this shouldn't be possible. – Cleb Jun 08 '16 at 23:11
See my recent edit for a hack to do row name range subsetting. Most `R` that I see does `df[beginInd:endInd,]` type row subsetting – evan.oman Jun 08 '16 at 23:15
Yes, subsetting by index seems to be more common but I am still surprised that there is no built-in for that. – Cleb Jun 08 '16 at 23:18

score 2 · Answer 2 · answered Jun 08 '16 at 23:26

2

An alternative approach to subset if you don't mind to work with data.table would be:

data.table::setDT(df)
df[1:3, c2:c4, with=F]
   c2 c3 c4
1:  2  3  4
2:  3  4  5
3:  4  5  6

This still does not solve the problem of subsetting row range though.

answered Jun 08 '16 at 23:26

Psidom

209,562
33
339
356

Ok, thanks for the alternative (upvoted). The row selection was not part of the original question; I did not expect it to be that different from the column selection so that would rather be a bonus ;) – Cleb Jun 08 '16 at 23:28

score 2 · Answer 3 · answered Jun 08 '16 at 23:37

A solution using dplyr package but you need to specify the row you want to select before hand

rowName2Match <- c("r1", "r5")

df1 <- df %>% 
  select(matches("2"):matches("4")) %>% 
  add_rownames() %>% 
  mutate(idRow = match(rowname, rowName2Match)) %>% 
  slice(which(!is.na(idRow))) %>% 
  select(-idRow)
df1

> df1
Source: local data frame [2 x 4]

  rowname    c2    c3    c4
   <chr> <int> <int> <int>
1      r1     2     3     4
2      r5     6     7     8

Thanks for the alternative solution but seems far more complicated than @evan058's solution. — Cleb, Jun 08 '16 at 23:42

score 1 · Answer 4 · answered Jun 08 '16 at 23:11

1

Adding onto @evan058's answer:

subset(df[rownames(df) %in% c("r3", "r4", "r5"),], select=c1:c4)

c1 c2 c3 c4
r3  3  4  5  6
r4  4  5  6  7
r5  5  6  7  8

But note, the : operator will probably not work here; you will have to write out the name of each row you want to include explicitly. It might be easier to group by a particular value of one of your other columns or to create an index column as @evan058 mentioned in comments.

answered Jun 08 '16 at 23:11

Bryan Goggin

2,449
15
17

Thanks but then `subset(df[c('r1', 'r3'),], select=c1:c4)` seems more convenient. But I actually want to avoid to specify the row names. Upvoted anyway ;) – Cleb Jun 08 '16 at 23:14

score -1 · Answer 5 · answered Jun 08 '16 at 23:23

-1

This seems way too easy so perhaps I'm doing something wrong.

df <- data.frame(c1=1:6, c2=2:7, c3=3:8, c4=4:9, c5=5:10, c6=6:11,
                 row.names=c('r1', 'r2', 'r3', 'r4', 'r5', 'r6'))


df[c('r1','r2'),c('c1','c2')]

   c1 c2
r1  1  2
r2  2  3

answered Jun 08 '16 at 23:23

Jason

1,559
1
9
14

1

That requires specifying the rows and columns which I want to avoid. PS: I did not downvote – Cleb Jun 08 '16 at 23:25

How to select range of columns in a dataframe based on their name and not their indexes?

5 Answers5

Linked

Related