I need to search a list of words into another list of phrases. I'm currenltly using str_detect
:
Example, (50 and 53 rows) this only found in First and Last Catalog row, and put her name:
rm(list = ls())
library(stringr)
library(stringdist)
library(data.table)
list <- data.table(listofnames = c("Hedy", "Eloise", "Lakeshia", "Coleen", "Tawny", "Yolando", "Alida", "Jin", "Brigida", "Wendell", "Elissa", "Evangeline", "Madison", "Napoleon", "Norah", "Mariana", "Ella", "Marissa", "Jan", "Anya", "Eleanor", "Roderick", "Gillian", "Carla", "Melva", "Tommie", "Eliana", "Cristal", "Hui", "Alycia", "Vonnie", "Lala", "Cleveland", "Barbera", "Rosetta", "Meg", "Divina", "Christy", "Dia", "Edna", "Foster", "Pa", "Tennille", "Renato", "Ethelene", "Annemarie", "Jazmine", "Adela", "Aleida", "Alyse"))
catalog <- data.table(name ="", msg = c("The turn solicits Foster the wasteful metal.","The comfort licenses the river.", "The well-made stone evaluates the noise.","The page indexs the amazing peace.", "The note drafts the gold.","The taste exchanges the deranged thing.", "The snobbish reason compiles the roll.","The structure installs the current.", "The letter broadens the wide winter.","The lackadaisical argument comforts the detail.", "The fear nurses the learned fiction.","The heat convinces the luxuriant soup.", "The long-term edge tends the competition.","The puzzled stretch formulates the glass.", "The disease interprets the utter morning.","The abashed country gauges the size.", "The steam adapts the mountainous burst.","The tacit color derives the prose.", "The way exchanges the slim cough.","The moldy force ranks the room.", "The river discovers the expert.","The devilish experience converts the development.", "The lewd weather directs the friend.","The thought furnishs the half stone.", "The tart degree minimizes the doubt.","The deadpan color exercises the protest.", "The point inspires the shock.","The damp expansion acts the ice.", "The overconfident judge dealt withs the secretary.","The food relates the tacit market.", "The doubt troubleshots the scintillating smile.","The ink inventorys the pale invention.", "The kindly competition directs the error.","The feigned doubt writes the sand.", "The kick pilots the expert.","The meal nurses the delightful morning.", "The form traces the seat.","The reward conveys the loss.", "The belief troubleshots the building.","The growth details the mountain.", "The ambiguous kick centralizes the crack.","The system programs the wacky morning.", "The paste rehabilitates the gainful night.","The jumpy silver experiments the driving.", "The silk maximizes the trouble.","The testy doubt qualifys the level.", "The journey revitalizes the military decision.","The cough demonstrates the pleasure.", "The high-pitched debt employs the argument.","The noxious credit chairs the slip.", "The lift Renato monitors Tennille the daughter.","The fight insures the gratis sound.", "The zesty Annemarie credit navigates the mother."))
distnames <- as.character(sort(unique(list$listofnames[list$listofnames != ""])))
for(i in 1:length(catalog$msg)){
names <- str_detect(catalog$msg[i], distnames)
if (sum(names == TRUE) == 1){
catalog$name[i] <- distnames[which(names == TRUE)]
}
}
the problem is that it is too slow compared with grep
, but I cannot make a foreach in names
because there are many more than messages (msg)
, and I would also have to write that if you have already completed the name, and found another, delete it because I do not want to save anything if you find 2 names or more of my database in the message. (IF in the code)
I do not know if there is any function for data.tables
as str_detect
but only returning one array with index of TRUEs, which I think a bit expedite the process by not having to return an array of one million TRUE or FALSE and then search in with it.
This example runs fast, but my list
name has 7 millions rows, i think create a pattern with paste its not a option and with a pattern i cant found what is her name.
catalog
have 5 millions rows
The names
var is created each time with 50 TRUE or FALSE, I was looking for something faster, only index of TRUE
matches like one vector with value 34
, indicating my distname[34]
its in catalog$msg[i]