I have a number of dataframes which I want to run some tokenising on and save to csv's. I am attempting to put the code that I have been working with in a function and write a csv with the name of the working file.
For this example I have presented a dataframe called subzibo2. When I run the function though I am getting errors at the write csv stage. I have tried concatenating the filename with both paste and sprintf but neither work.
For the paste()
option I get
Error in file(file, ifelse(append, "a", "w")) : invalid 'description' argument In addition: Warning message: In if (file == "") file <- stdout() else if (is.character(file)) { : the condition has length > 1 and only the first element will be used
for the sprintf()
option I get
Error in sprintf("subs/.%d.csv", prodsplit) : unsupported type
Can someone help please? What am I doing wrong? I have left the alt write.csv in the code as a comment for ease. The steps within the function all work when I run them without being wrapped in a function. For info, they take a spreadsheet and tokenise the column ProdNameReduced, and return a dataframe with all the various token phrase options (phrases or parts thereof) with the number of words in each phrase and occurrences within the subzibo2 dataframe.
library(tm)
library(RWeka)
library(plyr)
library(dplyr)
subzibo2 = data.frame(ProdNameReduced = c("zibo muffin fold over x 100", "zibo muffin fold over x 1", "zibo sandwich 250s x 1", "zibo sandwich x 1s", "zibo 500g clamshell punnet x 1", "zibo burger fold over x 300", "zibo burger fold over x 1", "zibo 500g clamshell punnet x 500s", "zibo 1kg clamshell punnet x 500s", "zibo 1kg clamshell punnet x 1", "zibo 4 cavity fruit tray x 1","zibo 4 cavity fruit tray x 500", "zibo 2 cavity fruit tray x 1", "zibo 2 cavity fruit tray x 1000"), Code = c("ZIBOZFO6BOX", "ZIBOZFO6", "ZIBOSANDWICH", "ZIBOS/WICHSINGL", "ZIBOCS85", "ZIBOBURGERBOX","ZIBOBURGER", "ZIBOBOX500G", "ZIBOBOX1KG", "ZIBO781KG", "ZIBO4LOOSE", "ZIBO4", "ZIBO2LOOSE", "ZIBO2"))
ProdType = function(prodsplit)
{
prodsplit$ProdNameReduced = as.character(prodsplit$ProdNameReduced)
max_ngram = max(sapply(strsplit(prodsplit$ProdNameReduced, " "), length))
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = max_ngram))}
prodsplit_corpus = Corpus(VectorSource(prodsplit$ProdNameReduced))
tdm <- TermDocumentMatrix(prodsplit_corpus, control = list(tokenize = BigramTokenizer))
rm(prodsplit_corpus)
tdm_matrix = as.matrix(tdm)
rm(tdm)
tdm_matrix_rowsums = sort(rowSums(tdm_matrix), decreasing = T)
rm(tdm_matrix)
tdm_matrix_rowsums_df = as.data.frame(tdm_matrix_rowsums)
rm(tdm_matrix_rowsums)
tdm_matrix_rowsums_df$phrases = row.names(tdm_matrix_rowsums_df)
rownames(tdm_matrix_rowsums_df) = NULL
tdm_matrix_rowsums_df$phrasecount = vapply(strsplit(tdm_matrix_rowsums_df$phrases, "\\S+"), length, integer(1))
colnames(tdm_matrix_rowsums_df) = c("occurence","phrases", "phrasecount")
tdm_matrix_rowsums_df = ddply(tdm_matrix_rowsums_df, .(phrases), colwise(sum))
tdm_matrix_rowsums_df = arrange(tdm_matrix_rowsums_df, phrases, occurence)
tdm_matrix_rowsums_df = select(tdm_matrix_rowsums_df, phrasecount, occurence, phrases)
tdm_matrix_rowsums_df$selector = character(nrow(tdm_matrix_rowsums_df))
#write.csv(tdm_matrix_rowsums_df, file = paste("subs/", prodsplit, ".csv", sep = ""))
write.csv(tdm_matrix_rowsums_df, file = sprintf("subs/.%d.csv" , prodsplit))
}
ProdType(subzibo2)