I am building a function which will require taking a dataset name and a variable name as separate inputs, then using the variable name as a label in an output dataset. Please see a very simple mockup of this task below.
This is for a text mining project, so I have hypothetical documents numbered 1-10, and keywords that could show up any number of times in each document as different columns in the dataset. I'm interested in cross-tabbing how often each word appears given a certain category of interest and summarizing that information in simple report.
...
#fake data
document <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
jack <- c(1, 0, 3, 2, 0, 1, 0, 0, 2, 1)
jill <- c(0, 0, 0, 1, 2, 1, 1, 2, 3, 0)
hill <- c(2, 1, 0, 2, 1, 0, 4, 2, 2, 0)
water <- c(4, 3, 5, 0, 1, 0, 0, 1, 2, 1)
outcome <- c(1, 0, 0, 1, 0, 1, 0, 0, 1, 1)
text.data <- data.frame(document, jack, jill, hill, water, outcome)
get_freq_info1 = function(token_name, this_token, outcome){
tbl_to_test = table(this_token>0, outcome)
return(tibble(token = token_name,
n_with_token = sum(tbl_to_test[2, 1:2]),
n_with_token_and_flag = (tbl_to_test[2,2]),
percent_with_token_and_fag = (tbl_to_test[2,2])/sum(tbl_to_test))
)
}
#check
get_freq_info1(token_name = "jack",
this_token= text.data$jack,
outcome= text.data$category==1)
As you can see, it doesn't make sense that I need to input both "jack" and text.data$jack. That seems repetitive. I will need to use this function to map over a long list of variables in a much bigger dataset, and I'd like to streamline this as much as possible.
Where I'm struggling is how to get R to mash the dataset name and the variable name together to appropriately run the query, while still being able to use the input variable name as a character value for labeling purposes. Ideally, my function would look something like this:
get_freq_info2 = function(ds, this_token, outcome){
ds_token = paste(ds,"$",this_token, sep="")
ds_outcome = paste(ds,"$",outcome, sep="")
tbl_to_test = table(ds_token>0, ds_outcome)
return(tibble(token = this_token,
n_with_token = sum(tbl_to_test[2, 1:2]),
n_with_token_and_IPV = (tbl_to_test[2,2]),
percent_with_token_and_fag = (tbl_to_test[2,2])/sum(tbl_to_test))
)
}
#check
get_freq_info2(ds = "text.data", this_token= "jack", outcome= category==1)