Search for a group of 20 words exists in a line of 1 million in a CSV file in R

Question

I'm trying to search a CSV file of 1 million lines of text using 20 - 30 words in R programming.

I have saved the words in a key and assign values to each word. I want to find of each line has these words and create a column and accumulate the score.

word <- c("U.S. Capital", "Biden", "Congress", "Marines", "Senate", "Santa")

value <- c(-0.5, -0.6, -0.4, -0.2, -0.4, -0.03)

Are you trying to operate on the raw file, or have you read it into R? Can you give a sample of what the file (and/or imported data) looks like (only a dozen or so rows needed) and the expected output of that sample data? (It would help to vary your sample data so that you have some with matches and some without.) — r2evans, Oct 24 '21 at 23:56

dnlbrky · Answer 1 · 2021-10-25T00:49:34.180

Welcome to StackOverflow! If you add more specifics I can refine my answer, but here is something to get you started.

library(data.table)

## Load your csv file
#search_in <- fread("path/to/file.csv")

## In lieu of a csv, create a table of example text values to search within
search_in <- data.table(text=c(
  "Visit the U.S. Capital and see Congress in action",
  "Santa Clause is (a) real (movie)",
  "The Marines were founded in 1775",
  "What does the fox say?",
  "The United States Senate is the upper chamber of the United States Congress"))

## Create a table of your search terms and the corresponding values
search_for <- data.table(
  word=c("U.S. Capital", "Biden", "Congress", "Marines", "Senate", "Santa"),
  value=c(-0.5, -0.6, -0.4, -0.2, -0.4, -0.03))

search_res <- merge(search_in[, id:=1L], search_for[, id:=1L], by="id", allow.cartesian=TRUE)[, 
  match:=text %like% word, by=.(text, word, value)][
    match==TRUE, .(words=paste(sort(word), collapse=", "), value=sum(value)), by=text]

search_res <- merge(search_in[, -"id"], search_res, on="text", all.x=TRUE)
search_res

##                                                                          text                  words value
##1:                           Visit the U.S. Capital and see Congress in action Congress, U.S. Capital -0.90
##2:                                            Santa Clause is (a) real (movie)                  Santa -0.03
##3:                                            The Marines were founded in 1775                Marines -0.20
##4: The United States Senate is the upper chamber of the United States Congress       Congress, Senate -0.80
##5:                                                      What does the fox say?                   <NA>    NA

The first line of code that creates search_res joins all rows from search_in and search_for, adds a column indicating if the search term is matched in the text column, subsets rows that match, and sums up the values.

The line after that joins the original search_in to the results, so you can see text lines that do not have a keyword match.

Depending on the size of your data this may be sufficient. If you're using Linux or macOS, you might investigate using grep or a similar bash solution.

Thank you for helping. I'll try the solution and get back with you. — Michael71, Oct 27 '21 at 03:55

Search for a group of 20 words exists in a line of 1 million in a CSV file in R

1 Answers1