1

I am trying to subset a large data frame with my columns of interest. I do so using the grep function, this selects one column too many ("has_socio"), which I would like to remove.

The following code does exactly what I want, but I find it unpleasant to look at. I want to do it in one line. Aside from just calling the first subset inside the second subset, can it be optimized?

DF <- read.dta("./big.dta")

DF0 <- na.omit(subset(DF, select=c(other_named_vars, grep("has_",names(DF)))))
DF0 <- na.omit(subset(DF0, select=-c(has_socio)))

I know similar questions have been asked (e.g. Subsetting a dataframe in R by multiple conditions) but I do not find one that addresses this issue specifically. I recognize I could just write the grep RE more carefully, but I feel the above code more clearly expresses my intent.

Thanks.

Community
  • 1
  • 1
rjturn
  • 323
  • 2
  • 8

2 Answers2

4

Replace your grep with:

vec <- c("blah", "has_bacon", "has_ham", "has_socio")
grep("^has_(?!socio$)", vec, value=T, perl=T)
# [1] "has_bacon" "has_ham"  

(?!...) is a negative lookahead operator, which looks ahead and makes sure that its contents do not follow the actual matching piece behind of it (has_ being the matching piece).

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • It seems the correct way is indeed to make my RE more specific. I used standard (not perl) RE syntax, which is a bit shorter: `grep("has_[^s]")`. – rjturn Feb 20 '14 at 03:34
1
setdiff(grep("has_", vec, value = TRUE), "has_socio")
## [1] "has_bacon" "has_ham"  
Jake Burkhead
  • 6,435
  • 2
  • 21
  • 32