1

I am looking for a solution in R to find how many time a word from a column/columns is present in another column of a data frame.

I have a DF with 4 columns (page, text, wildanimals and animals).

df <- tibble::tibble(page=c(12,6,9,18,2),
                 text=c("Dogs are related to wolves, but dogs are friendly",
                        "I love pets",
                        "I like goat and deer. Deer and goat",
                        "Zebra have stripes on their body",
                        "Lizards are Crocodiles have tails"
                        ))
wildanimals <- c("wolves", "tiger", "deer", "zebra", "crocodile")
animals <- c("dogs", "cats", "goat", "horse", "lizard")

cbind(df,animals, wildanimals)

I want to check if the words in column animals and wildanimals are present in column text and how many times. Something like this:

frequency <- c("3","0","4", "1", "2")
cbind(df,animals, wildanimals,frequency)

I have asked similar question here: Link to the Question, but it only tells if the word is present or not.

Fraxxx
  • 114
  • 1
  • 11

2 Answers2

2

We could use str_count to count the words that are in 'animals', 'wildanimals' can add (+) them together

library(dplyr)
library(stringr)
df1 <- df1 %>% 
   mutate(frequency = str_count(text, coll(animals, ignore_case = TRUE)) + 
                      str_count(text, coll(wildanimals, ignore_case = TRUE)))
df1
#  page                                              text animals wildanimals frequency
#1   12 Dogs are related to wolves, but dogs are friendly    dogs      wolves         3
#2    6                                       I love pets    cats       tiger         0
#3    9               I like goat and deer. Deer and goat    goat        deer         4
#4   18                  Zebra have stripes on their body   horse       zebra         1
#5    2                 Lizards are Crocodiles have tails  lizard   crocodile         2

Or in a compact way by pasteing the 'animals', 'wildanimals' column into a single one with sep as | and wrap with regex instead of coll in str_count

df1 <- df1 %>%
   mutate(frequency = str_count(text, 
      regex(str_c(animals, wildanimals, sep="|"), ignore_case = TRUE)))
df1
#   page                                              text animals wildanimals frequency
#1   12 Dogs are related to wolves, but dogs are friendly    dogs      wolves         3
#2    6                                       I love pets    cats       tiger         0
#3    9               I like goat and deer. Deer and goat    goat        deer         4
#4   18                  Zebra have stripes on their body   horse       zebra         1
#5    2                 Lizards are Crocodiles have tails  lizard   crocodile         2

NOTE: %>% doesn't create the column in place, so we need to reassign (<-) it to the original object. If we want to create the column in place in the original object 'df1', use the magrittr operator (%<>%)

akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Could also use `glue::glue()`: `df %>% mutate(frequency = str_count(tolower(text), glue("{animals}|{wildanimals}")))` – andrew_reece Sep 20 '20 at 18:42
  • @andrew_reece Looks good to me. Could you also post that as a solution. Thanks – akrun Sep 20 '20 at 18:48
  • 1
    Sure - it seemed close enough to you that I thought maybe a comment was more appropriate. But I'll add as a separate solution now. – andrew_reece Sep 20 '20 at 18:53
  • @akrun Thanks for the solution. Can you also please provide a way how I can save it as a data frame. I don't have the "frequency" column in my data frame "df". – Fraxxx Sep 20 '20 at 18:57
  • @Fraxxx If you want column only as a separate data.frame. just change the `mutate` to `transmute`i.e. `df2 <- df1 %>% transmute(frequency = str_count(text, regex(str_c(animals, wildanimals, sep="|"), ignore_case = TRUE)))` – akrun Sep 20 '20 at 18:58
  • @akrun No, When I run your solution, the data frame "df" doesn't have the frequency column. I need the data frame with all columns including frequency. – Fraxxx Sep 20 '20 at 19:07
  • @Fraxxx Just do `df1 <- df1 %>% mutate(frequency = str_count(text, coll(animals, ignore_case = TRUE)) + str_count(text, coll(wildanimals, ignore_case = TRUE)))` – akrun Sep 20 '20 at 19:07
2

Another way to use str_count is by merging animals and wildanimals with glue::glue(). Then match against the lowercase version of text:

library(tidyverse)

df %>% mutate(frequency = str_count(tolower(text), glue("{animals}|{wildanimals}")))
andrew_reece
  • 20,390
  • 3
  • 33
  • 58