1

I've 2 dataframes in R - one list of names and other one is a word dictionary. If any part a name is part of word dictionary then replace by NA else return the name

Names - Dataframe

Name
Louis
Messi
duplessis
Jegan
Praveen

word dictionary - Dataframe

Dictionary
vee
sis

Expected Output

Name        Processed_Name
Louis       Louis
Messi       Messi
duplessis   NA
Jegan       Jegan
Praveen     NA
Praveen Kumar
  • 107
  • 1
  • 7
  • 4
    What have you tried? Here is [a good start to test with grepl](https://stackoverflow.com/questions/10128617/test-if-characters-in-string-in-r), and see [here](https://stackoverflow.com/questions/30180281/how-can-i-check-if-multiple-strings-exist-in-another-string) – zx8754 Jan 16 '18 at 11:27

1 Answers1

2
library(data.table) # needed library

# create data
dt <- data.table("Name"=c("Louis",
                          "Messi",
                          "duplessis",
                          "Jegan",
                          "Praveen"))
dict <- c("vee","sis")

# make a combined vector of the words in the dictionary
dict_2 <- paste0(dict,collapse = "|") 
# desired output
dt[,processed_Name:=ifelse(Name%like%dict_2,NA,Name)]

OUTPUT

        Name processed_Name
1:     Louis          Louis
2:     Messi          Messi
3: duplessis             NA
4:     Jegan          Jegan
5:   Praveen             NA

UPDATE based on OP's comment

  # changed the input a bit, so that it contains the numbers 
# that i am going to generate for the dictionary.
dt <- data.table("Name"=c("Loui1s",
                          "Messi",
                          "duple2ssis",
                          "Jegan",
                          "Praveen"))

dict_all <- as.character(c(1:5000)) # i generate numbers so that they all are different
dict_split <- split(dict_all, ceiling(seq_along(dict_all)/1000))
dict_split_2 <- lapply(dict_split, function(x){paste0(x, collapse = "|")})
dt[,processed_Name_2:=ifelse(Name%like%dict_split_2[[1]] | Name%like%dict_split_2[[2]] |
                               Name%like%dict_split_2[[3]] | Name%like%dict_split_2[[4]] |
                               Name%like%dict_split_2[[5]],NA,Name)]
quant
  • 4,062
  • 5
  • 29
  • 70