Trying to write a function in R that would :
1) look through each observation's string variables
2) identify and count certain strings that the user defines
3) report the findings as a proportion of the total number of words each observation contains.
Here's a sample dataset:
df <- data.frame(essay1=c("OMG. american sign language. knee-slides in leather pants", "my face looks totally different every time. lol."),
essay2=c("cheez-its and dried cranberries. sparkling apple juice is pretty\ndamned coooooool too.<br />\nas for music, movies and books: the great american authors, mostly\nfrom the canon, fitzgerald, vonnegut, hemmingway, hawthorne, etc.\nthen of course the europeans, dostoyevski, joyce, the romantics,\netc. also, one of the best books i have read is all quiet on the\nwestern front. OMG. I really love that. lol", "at i should have for dinner\nand when; some random math puzzle, which I loooooove; what it means to be alive; if\nthe meaning of life exists in the first place; how the !@#$ can the\npolitical mess be fixed; how the %^&* can the education system\nbe fixed; my current game design project; my current writing Lol"),
essay3=c("Lol. I enjoy life and then no so sure what else to say", "how about no?"))
The furtherest I managed to get is this function:
find.query <- function(char.vector, query){
which.has.query <- grep(query, char.vector, ignore.case = TRUE)
length(which.has.query) != 0
}
profile.has.query <- function(data.frame, query){
query <- tolower(query)
has.query <- apply(data.frame, 1, find.query, query=query)
return(has.query)
}
This allows the user to detect if a given value is in the 'essay' for a given used, but that's not enough for the three goals outlined above. What this function would ideally do is to count the number of words identified, then divide that count by the total count of words in the overall essays (row sum of counts for each user).
Any advice on how to approach this?