0

I am building a function which will require taking a dataset name and a variable name as separate inputs, then using the variable name as a label in an output dataset. Please see a very simple mockup of this task below.

This is for a text mining project, so I have hypothetical documents numbered 1-10, and keywords that could show up any number of times in each document as different columns in the dataset. I'm interested in cross-tabbing how often each word appears given a certain category of interest and summarizing that information in simple report.

...
    
#fake data
document <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
jack <-       c(1, 0, 3, 2, 0, 1, 0, 0, 2, 1)
jill <-     c(0, 0, 0, 1, 2, 1, 1, 2, 3, 0)
hill <-     c(2, 1, 0, 2, 1, 0, 4, 2, 2, 0)
water <-    c(4, 3, 5, 0, 1, 0, 0, 1, 2, 1)
outcome <-  c(1, 0, 0, 1, 0, 1, 0, 0, 1, 1)

text.data <- data.frame(document, jack, jill, hill, water, outcome)


get_freq_info1 = function(token_name, this_token, outcome){
    tbl_to_test = table(this_token>0, outcome) 
    return(tibble(token = token_name,
                                n_with_token = sum(tbl_to_test[2, 1:2]),
                                n_with_token_and_flag = (tbl_to_test[2,2]),
                                percent_with_token_and_fag = (tbl_to_test[2,2])/sum(tbl_to_test))
    )
    
}

#check
get_freq_info1(token_name = "jack",
                             this_token= text.data$jack, 
                             outcome= text.data$category==1)

As you can see, it doesn't make sense that I need to input both "jack" and text.data$jack. That seems repetitive. I will need to use this function to map over a long list of variables in a much bigger dataset, and I'd like to streamline this as much as possible.

Where I'm struggling is how to get R to mash the dataset name and the variable name together to appropriately run the query, while still being able to use the input variable name as a character value for labeling purposes. Ideally, my function would look something like this:

get_freq_info2 = function(ds, this_token, outcome){
    ds_token = paste(ds,"$",this_token, sep="")
    ds_outcome = paste(ds,"$",outcome, sep="")
    tbl_to_test = table(ds_token>0, ds_outcome) 
    return(tibble(token = this_token,
                n_with_token = sum(tbl_to_test[2, 1:2]),
                n_with_token_and_IPV = (tbl_to_test[2,2]),
                percent_with_token_and_fag = (tbl_to_test[2,2])/sum(tbl_to_test))
    )
    
}
        #check
        get_freq_info2(ds = "text.data", this_token= "jack", outcome= category==1)
  • No need to mess with the `paste` part. You should avoid `$` in this case since the column name is a character string. In this case use `[[ ]]`. For example `ds_token = ds[[this_token]]` and `ds_outcome = ds[[outcome]]` – MrFlick Dec 18 '20 at 04:04
  • Thank you so much for the response, this is incredibly helpful! – Julie Kafka Dec 18 '20 at 22:52

0 Answers0