20

Not sure how else to ask this but, I want to search for a term within several string elements. Here's what my code looks like (but wrong):

inplay = vector(length=nrow(des))
for (ii in 1:nrow(des)) {
 if (des[ii] = 'In play%')
  inplay[ii] = 1
 else inplay[ii] = 0
}

des is a vector that stores strings such as "Swinging Strike", "In play (run(s))", "In play (out(s) recorded)" and etc. What I want inplay to store is a 1s and 0s vector corresponding with the des vector, with the 1s in inplay indicating that the des value had "In play%" in it and 0s otherwise.

I believe the 3rd line is incorrect, because all this does is return a vector of 0s with a 1 in the last element.

Thanks in advance!

Maël
  • 45,206
  • 3
  • 29
  • 67
Albert Lyu
  • 383
  • 2
  • 6
  • 10

4 Answers4

26

The data.table package has syntax that is often similar to SQL. The package includes %like%, which is a "convenience function for calling regexpr". Here is an example taken from its help file:

## Create the data.table:
DT = data.table(Name=c("Mary","George","Martha"), Salary=c(2,3,4))

## Subset the DT table where the Name column is like "Mar%":
DT[Name %like% "^Mar"]
##      Name Salary
## 1:   Mary      2
## 2: Martha      4
dnlbrky
  • 9,396
  • 2
  • 51
  • 64
19

The R analog to SQL's LIKE is just R's ordinary indexing syntax.

The 'LIKE' operator selects data rows from a table by matching string values in a specified column against a user-supplied pattern

> # create a data frame having a character column
> clrs = c("blue", "black", "brown", "beige", "berry", "bronze", "blue-green", "blueberry")
> dfx = data.frame(Velocity=sample(100, 8), Colors=clrs)
> dfx
            Velocity    Colors
        1       90       blue
        2       94      black
        3       71      brown
        4       36      beige
        5       75      berry
        6        2     bronze
        7       89    blue-green
        8       93    blueberry

> # create a pattern to use (the same as you would do when using the LIKE operator)
> ptn = '^be.*?'  # gets beige and berry but not blueberry
> # execute a pattern-matching function on your data to create an index vector
> ndx = grep(ptn, dfx$Colors, perl=T)
> # use this index vector to extract the rows you want from the data frome:
> selected_rows = dfx[ndx,]
> selected_rows
   Velocity Colors
     4       36  beige
     5       75  berry 

In SQL, that would be:

SELECT * FROM dfx WHERE Colors LIKE ptn3
doug
  • 69,080
  • 24
  • 165
  • 199
  • Why do you wrap `grep()` with `fnx`? – Vince Aug 22 '10 at 03:54
  • i originally had in mind a function that took the data frame as an argument, rather than just a 1D vector. In any event, edited to remove function wrapper. – doug Aug 22 '10 at 04:29
  • This works, but suppose I want to store a 1 in vector inplay if the row number is contained in ndx, and a 0 otherwise, where the length of vector inplay is the length of dfx. How do I go about this? I'm playing with IF and ELSE statements but I'm not getting this to work. Thanks in advance! – Albert Lyu Aug 22 '10 at 11:39
  • I figured it out. Suppose SL is the matrix that stores "Swinging Strike", "In play(run(s))" etc. ndx = grep('^In play.*?', SL$des, perl=T) SL <- transform(SL, inplay=mat.or.vec(1,nrow(SL))[des]) SL$inplay <- replace(SL$inplay, ndx, 1) Thanks! – Albert Lyu Aug 22 '10 at 12:00
  • just saw your comment--so you solved your question above "[s]uppose i want to store...."? – doug Aug 22 '10 at 19:40
3

Something like regexpr?

> d <- c("Swinging Strike", "In play (run(s))", "In play (out(s) recorded)")
> regexpr('In play', d)
[1] -1  1  1
attr(,"match.length")
[1] -1  7  7
> 

or grep

> grep('In play', d)
[1] 2 3
> 
Vince
  • 7,608
  • 3
  • 41
  • 46
0

Since stringr 1.5.0, you can use str_like, which follows the structure of SQL's LIKE:

library(stringr)

fruit <- c("apple", "banana", "pear", "pineapple")
str_like(fruit, "app%")
#[1]  TRUE FALSE FALSE FALSE

Not only does it include %, but also several other operators (see ?str_like).

  • Must match the entire string

  • _⁠ matches a single character (like .)

  • ⁠%⁠ matches any number of characters (like ⁠.*⁠)

  • ⁠%⁠ and ⁠_⁠ match literal ⁠%⁠ and ⁠_⁠

  • The match is case insensitive by default

Maël
  • 45,206
  • 3
  • 29
  • 67