0

For a list of strings that exist as rows within a table, I want to identify the frequency of those strings within rows of another data table in R. Simultaneously, I want to sum the values of rows that contain these strings.

For example, my reference table which contains a list of strings would look something like this:

+-----------------------------+
|String                       |
+-----------------------------+
|Dixon                        |
+-----------------------------+
|Nina Kraviz                  |
+-----------------------------+
|DJ Tennis                    |
+-----------------------------+

And my table that I want to analyze would look something like this:

+--------------------------------+
|String                |Score    |
+--------------------------------+
|Nina Kraviz @ Hyde    |100      |
+--------------------------------+
|DJ Tennis?            |200      |
+--------------------------------+
|From Dixon            |100      |
+--------------------------------+
|From Kevin Saunderson |100      |
+--------------------------------+
|Dixon                 |300      |
+--------------------------------+
|Nina Kraviz           |200      |
+--------------------------------+

I want my resulting table to look like this:

+---------------------------------+
|String             |Score        |
+---------------------------------+
|Dixon              |400          |
+---------------------------------+
|Nina Kraviz        |300          |
+---------------------------------+
|DJ Tennis          |200          |
+---------------------------------+

I've tried using n-grams and tokenizing but it's not working in a way that makes it easy, as artists names can contain 1, 2 or 3 words commonly. Any help would be appreciated.

  • 1
    Please share a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) including a small example of your data used. – Martin Gal Jul 02 '20 at 17:35

1 Answers1

1

We can filter the rows of the second data.frame based on partial matching

library(dplyr)
library(stringr)
pat <- str_c("\\b(", str_c(df1$String, collapse="|"), ")\\b")
df2 %>%
     group_by(String = str_extract(String, pat)) %>%
     filter(!is.na(String)) %>%
     summarise(Score = sum(Score, na.rm = TRUE))
# A tibble: 3 x 2
#  String      Score
#  <chr>       <dbl>
#1 Dixon         400
#2 DJ Tennis     200
#3 Nina Kraviz   300

data

df1 <- structure(list(String = c("Dixon", "Nina Kraviz", "DJ Tennis"
)), class = "data.frame", row.names = c(NA, -3L))

df2 <- structure(list(String = c("Nina Kraviz @ Hyde", "DJ Tennis?", 
"From Dixon", "From Kevin Saunderson", "Dixon", "Nina Kraviz"
), Score = c(100, 200, 100, 100, 300, 200)), class = "data.frame", 
row.names = c(NA, 
-6L))
akrun
  • 874,273
  • 37
  • 540
  • 662