Subset data to contain only columns whose names match a condition

Question

Is there a way for me to subset data based on column names starting with a particular string? I have some columns which are like ABC_1 ABC_2 ABC_3 and some like XYZ_1, XYZ_2,XYZ_3 let's say.

How can I subset my df based only on columns containing the above portions of text (lets say, ABC or XYZ)? I can use indices, but the columns are too scattered in data and it becomes too much of hard coding.

Also, I want to only include rows from each of these columns where any of their value is >0 so if either of the 6 columns above has a 1 in the row, it makes a cut into my final data frame.

What language?! `R`? Add the relevant tag or no-one following R will see it. Please give a reproducible example, i.e. the structure of your dataframe, preferably `dput( head( df ) )`, or at the very least, `str( df )`. — Simon O'Hanlon, Sep 03 '13 at 08:24

score 119 · Accepted Answer · edited Jun 20 '20 at 09:12

Try grepl on the names of your data.frame. grepl matches a regular expression to a target and returns TRUE if a match is found and FALSE otherwise. The function is vectorised so you can pass a vector of strings to match and you will get a vector of boolean values returned.

Example

#  Data
df <- data.frame( ABC_1 = runif(3),
            ABC_2 = runif(3),
            XYZ_1 = runif(3),
            XYZ_2 = runif(3) )

#      ABC_1     ABC_2     XYZ_1     XYZ_2
#1 0.3792645 0.3614199 0.9793573 0.7139381
#2 0.1313246 0.9746691 0.7276705 0.0126057
#3 0.7282680 0.6518444 0.9531389 0.9673290

#  Use grepl
df[ , grepl( "ABC" , names( df ) ) ]
#      ABC_1     ABC_2
#1 0.3792645 0.3614199
#2 0.1313246 0.9746691
#3 0.7282680 0.6518444

#  grepl returns logical vector like this which is what we use to subset columns
grepl( "ABC" , names( df ) )
#[1]  TRUE  TRUE FALSE FALSE

To answer the second part, I'd make the subset data.frame and then make a vector that indexes the rows to keep (a logical vector) like this...

set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
            ABC_2 = sample(0:1,3,repl = TRUE),
            XYZ_1 = sample(0:1,3,repl = TRUE),
            XYZ_2 = sample(0:1,3,repl = TRUE) )

# We will want to discard the second row because 'all' ABC values are 0:
#  ABC_1 ABC_2 XYZ_1 XYZ_2
#1     0     1     1     0
#2     0     0     1     0
#3     1     1     1     0


df1 <- df[ , grepl( "ABC" , names( df ) ) ]

ind <- apply( df1 , 1 , function(x) any( x > 0 ) )

df1[ ind , ]
#  ABC_1 ABC_2
#1     0     1
#3     1     1

@kaos1511 it is a `regular expression` so yes. Read `?regexpr` To get `XYZ` as well chain them together in the expresison like this... `grepl( "ABC|XYZ" , names( df ) )` — Simon O'Hanlon, Sep 03 '13 at 08:36
Thanks simon, how do i also ensure i only take rows which are >0 value for all columns that match the string i specify for column names?..will i need to see the dataframe that results and then specify manually?. If i see that there are 6 columns that match, should i say something like test<-df[ , grepl( "ABC|XYZ" , names( df ) ) ] and then test<-test[c(1:6)>0], will this work? — , Sep 03 '13 at 08:46
@kaos1511 that is a different question! It is bad form to keep asking question in the comments after the OP has been answered and it frustrates those who answer. Please ask a new question so everyone will see it. Thanks. — Simon O'Hanlon, Sep 03 '13 at 08:52
.@SimonO'Hanlon - I am using `grepl()` as `data <- data [, !grepl("Unique-",names(data))]` to remove column where column name starts with `Unique-`. I see that `grepl()` is appending duplicate column name with `.1`,`.2`,`.3` etc. Is it possible to use `grepl()` without affecting column name even if there are duplicates in the data frame? — Chetan Arvind Patil, Jan 22 '19 at 18:53

score 44 · Answer 2 · answered Feb 28 '17 at 12:52

44

You can also use starts_with and dplyr's select() like so:

df <- df %>% dplyr:: select(starts_with("ABC"))

answered Feb 28 '17 at 12:52

basbabybel

780
8
17

score 19 · Answer 3 · answered Feb 04 '19 at 02:47

19

Just in case for data.table users, the following works for me:

df[, grep("ABC", names(df)), with = FALSE]

answered Feb 04 '19 at 02:47

Alex Ho

418
1
4
8

score 14 · Answer 4 · edited Jun 26 '17 at 04:30

14

Using dplyr you can:

df <- df %>% dplyr:: select(grep("ABC", names(df)), grep("XYZ", names(df)))

edited Jun 26 '17 at 04:30

Ronak Shah

377,200
20
156
213

answered Oct 13 '16 at 20:40

Ram B

157
1
9

score 14 · Answer 5 · edited Feb 22 '22 at 05:09

14

Simplest solution, given to me by my statistics professor:

df[,grep("pattern", colnames(df))]

That's it. It doesn't give you booleans or anything, it just gives you your dataset that follows that pattern.

edited Feb 22 '22 at 05:09

AndrewGB

16,126
5
18
49

answered Mar 23 '21 at 20:54

Miranda

141
1
2

2

This should be the top answer! – pascal May 18 '21 at 12:58

score 5 · Answer 6 · answered Mar 01 '18 at 15:51

5

This worked for me:

df[,names(df) %in% colnames(df)[grepl(str,colnames(df))]]

answered Mar 01 '18 at 15:51

guero64

1,019
1
12
18

score 3 · Answer 7 · edited May 10 '22 at 12:09

3

Try this (here, looking for variables whose name contains 'date', including all case combinations):

df %>%  dplyr::select(contains("date", ignore.case = TRUE))

edited May 10 '22 at 12:09

ah bon

9,293
12
65
148

answered May 11 '21 at 21:01

Barry DeCicco

251
1
7

score 0 · Answer 8 · edited Sep 27 '21 at 00:15

0

Building off the above, I think it is the most flexible. Note that you'll need to use dplyr, but that's not a terrible thing.

Advantage: You can search for more than "contains". Here, I use "starts_with" for a relatively common string "ST". Using "grep" here could easily have driven you mad; mad, I say!

library(dplyr)

df %>% dplyr::select(starts_with("ST",ignore.case = TRUE))

edited Sep 27 '21 at 00:15

Vinícius Félix

8,448
6
16
32

answered Sep 26 '21 at 21:10

Doch

3
2

score 0 · Answer 9 · answered Apr 21 '22 at 19:20

Many of the tidyselect options have been mentioned already. contains and starts_with work very well with this specific problem. For more complicated conditions/matching you can use matches, which will select columns using a regular expression match:

library(dplyr)

df %>% 
  select(matches("^ABC")) # starts with "ABC"

# case insensitive match
df %>% 
  select(matches("(?i)^abc")) # starts with "ABC", "Abc", "abc", etc.

score 0 · Answer 10 · answered May 10 '22 at 12:14

I'll provide alternative solutions to subset one or multiple columns, based on @Simon O'Hanlon's sample data:

> library(tidyverse)
> library(dplyr)
> 
> df <- data.frame(D=runif(3),
+   ABC_1 = runif(3),
+   ABC_2 = runif(3),
+   XYZ_1 = runif(3),
+   XYZ_2 = runif(3)
+   )
> 
> df %>% 
+   dplyr::select(matches('ABC'))
       ABC_1      ABC_2
1 0.06445754 0.16957677
2 0.75470562 0.06221405
3 0.62041003 0.10902927
> 
> df %>% 
+   dplyr::select(matches('ABC|XYZ'))
       ABC_1      ABC_2     XYZ_1     XYZ_2
1 0.06445754 0.16957677 0.3817164 0.1922095
2 0.75470562 0.06221405 0.1693109 0.2571700
3 0.62041003 0.10902927 0.2986525 0.1812318

Subset data to contain only columns whose names match a condition

10 Answers10

Example

Linked

Related