Find specific strings, count their frequency in a given text, and report it as a proportion of the number of words

Question

Trying to write a function in R that would :

1) look through each observation's string variables

2) identify and count certain strings that the user defines

3) report the findings as a proportion of the total number of words each observation contains.

Here's a sample dataset:

df <- data.frame(essay1=c("OMG. american sign language. knee-slides in leather pants", "my face looks totally different every time. lol."),  
             essay2=c("cheez-its and dried cranberries. sparkling apple juice is pretty\ndamned coooooool too.<br />\nas for music, movies and books: the great american authors, mostly\nfrom the canon, fitzgerald, vonnegut, hemmingway, hawthorne, etc.\nthen of course the europeans, dostoyevski, joyce, the romantics,\netc. also, one of the best books i have read is all quiet on the\nwestern front. OMG. I really love that. lol", "at i should have for dinner\nand when; some random math puzzle, which I loooooove; what it means to be alive; if\nthe meaning of life exists in the first place; how the !@#$ can the\npolitical mess be fixed; how the %^&amp;* can the education system\nbe fixed; my current game design project; my current writing Lol"),  
             essay3=c("Lol. I enjoy life and then no so sure what else to say", "how about no?"))

The furtherest I managed to get is this function:

find.query <- function(char.vector, query){
  which.has.query <- grep(query, char.vector, ignore.case = TRUE)
  length(which.has.query) != 0
}
profile.has.query <- function(data.frame, query){
  query <- tolower(query)
  has.query <- apply(data.frame, 1, find.query, query=query)
  return(has.query)
}

This allows the user to detect if a given value is in the 'essay' for a given used, but that's not enough for the three goals outlined above. What this function would ideally do is to count the number of words identified, then divide that count by the total count of words in the overall essays (row sum of counts for each user).

Any advice on how to approach this?

Please check your input data.frame, you probably didn't escape some quote because it gives errors... — digEmAll, Mar 20 '16 at 18:37

score 0 · Accepted Answer · edited May 23 '17 at 11:45

Using the stringi package as in this post:

How do I count the number of words in a text (string) in R?

library(stringi)

words.identified.over.total.words <- function(dataframe, query){
  # make the query all lower-case
  query <- tolower(query)

  # count the total number of words
  total.words <- apply(dataframe, 2, stri_count, regex = "\\S+")

  # count the number of words matching query
  number.query <- apply(dataframe, 2, stri_count, regex = query)

  # divide the number of words identified by total words for each column
  final.result <- colSums(number.query) / colSums(total.words)

  return(final.result)
}

(The df in your question has each essay in a column, so the function sums each column. However, in the text of your question you say you want row sums. If the input data frame was meant to have one essay per row, then you can change the function to reflect that.)

This is brilliant! And it works quite well. To be clear, to change the function to the rows, we would change lines 3 and 4 to be: `total.words <- apply(dataframe, 1, stri_count, regex = "\\S+") number.query <- apply(dataframe, 1, stri_count, regex = query)` Then, we would change the fifth line line to be: `final.result <- rowsum(number.query) / rowsum(total.words)` — Cauchy, Mar 23 '16 at 15:56

Find specific strings, count their frequency in a given text, and report it as a proportion of the number of words

1 Answers1