2

I have a dataframe in the format mentioned below:

                  String  Keyword                           
1 Apples bananas mangoes   mangoes                    
2 Apples bananas mangoes   bananas                    
3 Apples bananas mangoes   peach   
.....  

Its a dataframe (50000+ rows). I'm currently manually using the ifelse statement in batches.

data$Result<- ifelse(grepl("apples",data$String,ignore.case = TRUE)==TRUE,"apples",  
              ifelse(grepl("bananas",data$String,ignore.case = TRUE)==TRUE,"bananas",
               ifelse(grepl("mangoes",data$String,ignore.case = TRUE)==TRUE,"mangoes","unavailable")))


                String    Keyword Result
Apples bananas mangoes    mangoes mangoes  
Apples bananas mangoes    bananas bananas  
Apples bananas mangoes    peach   unavailable

Is there a way, where I could store String and Keyword in a list and then apply grepl on the entire list?

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
red
  • 53
  • 5
  • 5
    Please provide some example data and what the output should look like. If you have structures ready, you can use `dput`. See [this page](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more information on how to supplement your question with some dazzling fake data. – Roman Luštrik Jul 15 '15 at 13:09

3 Answers3

3

I'm assuming this is what you want:

df <- data.frame(string=rep("Apples bananas mangoes",3), keyword=c("mangoes", "bananas", "peach"))

df$result <- ifelse(mapply(grepl,df$keyword, df$string), as.character(df$keyword), "Unavailable")

                 string keyword      result
1 Apples bananas mangoes mangoes     mangoes
2 Apples bananas mangoes bananas     bananas
3 Apples bananas mangoes   peach Unavailable

Update

Based on the comment, it sounds like you have a list of words that you want to check against the keyword. If that is the case, something like this might work:

#Set up toy dataset
set.seed(123)
df <- data.frame(Keyword = sample(c("mangoes", "bananas", "apples","lemons" , "peach"), 10, replace = TRUE))
df

#Choose your searchwords globally
searchwords <- c("apples", "bananas", "mangoes")

library(data.table)
library(stringi)
setDT(df)
for (x in searchwords) df[Keyword == x, result := Keyword]
df[is.na(result), result := "Unavailable"]
df

     Keyword      result
 1: bananas     bananas
 2:  lemons Unavailable
 3:  apples      apples
 4:   peach Unavailable
 5:   peach Unavailable
 6: mangoes     mangoes
 7:  apples      apples
 8:   peach Unavailable
 9:  apples      apples
10:  apples      apples
Serban Tanasa
  • 3,592
  • 2
  • 23
  • 45
  • Obviosuly, data.table or ddplyr solutions will be faster. – Serban Tanasa Jul 15 '15 at 14:48
  • Thanks for the help @Serban. I suppose I haven't framed my question correctly. The column **keyword** is actually arranged randomly. So the **result** column will return the keyword only if the exact match is found anywhere in column **keyword**, otherwise it should return Unavailable. I'm quite new to R and programming in general. So I apologize if i'm not framing the question correctly. – red Jul 16 '15 at 07:00
3

Here's a simple and efficient solution with a combination of data.table and the stringi package:

library(data.table)
library(stringi)
setDT(df)[stri_detect_fixed(String, Keyword, case_insensitive = TRUE), result := Keyword]
#                    String Keyword  result
# 1: Apples bananas mangoes mangoes mangoes
# 2: Apples bananas mangoes bananas bananas
# 3: Apples bananas mangoes   peach      NA

Alternatively, a data.table-only version:

library(data.table)
setDT(df)[, result := Keyword[grep(Keyword, String, ignore.case = TRUE)], by = .(Keyword, String)]

Benchmark

Here's a benchmark on a 5e5 data set against the mapply answer. (The for loop answer haven't finished running yet):

set.seed(123)
df1 <- data.frame(String = rep('Apples bananas mangoes', 5e5),
                  Keyword = sample(c("mangoes", "bananas", "peach"), 5e5, replace = TRUE))


system.time(df1$result2 <- ifelse(mapply(grepl,df1$Keyword, df1$String, ignore.case = TRUE), as.character(df1$Keyword), "Unavailable"))
# user  system elapsed 
# 40.78    0.02   41.12 
system.time(setDT(df1)[stri_detect_fixed(String, Keyword, case_insensitive = TRUE), result3 := Keyword])
# user  system elapsed 
# 0.52    0.01    0.53 
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Except it returns NA instead of Unavailable, heh. – Serban Tanasa Jul 15 '15 at 14:46
  • @SerbanTanasa that is fixable by just `df1[is.na(result), result := "Unavailable"]` which takes `.02` of second on my system. Which makes the overall running time from `.52` to `.54` which is `X75` faster than your solution (on that tiny data- it will grow exponentially on a bigger data). The reason I left it `NA` is because `NA` is easier to manipulate and work with because of functions such as `is.na` and `na.omit`. – David Arenburg Jul 15 '15 at 14:54
  • I'm a huge data.table fan. Couldn't get grep/grepl to work in it, nice to see your example. – Serban Tanasa Jul 15 '15 at 15:38
  • Ok, Added a `data.table` only version which could be slower in case there are many unique `Keyword/String` combinations. – David Arenburg Jul 15 '15 at 15:53
  • Interesting, in my benchmarks, the data.table-only solution is significantly faster. But yeah, that could change with increasing nrs. of keyword/string combos. – Serban Tanasa Jul 15 '15 at 16:29
  • It will significantly change if the unique combinations number will be very large – David Arenburg Jul 15 '15 at 17:54
  • Thanks a lot for correctly formatting the question @DavidArenburg and for the reply. I suppose I haven't framed my question correctly. The column **keyword** is actually arranged randomly. So the **result** column will return the keyword only if the exact match is found anywhere in column **keyword**, otherwise it should return Unavailable. I'm quite new to R and programming in general. So I apologize if i'm not framing the question correctly. – red Jul 16 '15 at 07:03
0

Here is a version using 'dplyr' and 'stringr':

library(dplyr)
library(stringr)
df <- mutate(df, result = ifelse(str_detect(string, keyword)==TRUE,
  keyword, "Unavailable"))

Here is the line I used to create the play data:

df <- data.frame(string = rep("Apples bananas mangoes", 3), keyword = c("mangoes", "bananas", "peaches"), stringsAsFactors=FALSE)

And here is the output I get:

                  string keyword      result
1 Apples bananas mangoes mangoes     mangoes
2 Apples bananas mangoes bananas     bananas
3 Apples bananas mangoes peaches Unavailable
ulfelder
  • 5,305
  • 1
  • 22
  • 40
  • I wonder if you tried benchmarking this. I'm trying this on a 5e5 data set and it's running over 5 minutes now – David Arenburg Jul 15 '15 at 14:33
  • No, I did not. It runs quickly on the (tiny) play data set, but it sounds like it runs into the usual disadvantages of `for()` on large data sets. – ulfelder Jul 15 '15 at 14:37