0

I have a number of dataframes which I want to run some tokenising on and save to csv's. I am attempting to put the code that I have been working with in a function and write a csv with the name of the working file.

For this example I have presented a dataframe called subzibo2. When I run the function though I am getting errors at the write csv stage. I have tried concatenating the filename with both paste and sprintf but neither work.

For the paste() option I get

Error in file(file, ifelse(append, "a", "w")) : invalid 'description' argument In addition: Warning message: In if (file == "") file <- stdout() else if (is.character(file)) { : the condition has length > 1 and only the first element will be used

for the sprintf() option I get

Error in sprintf("subs/.%d.csv", prodsplit) : unsupported type

Can someone help please? What am I doing wrong? I have left the alt write.csv in the code as a comment for ease. The steps within the function all work when I run them without being wrapped in a function. For info, they take a spreadsheet and tokenise the column ProdNameReduced, and return a dataframe with all the various token phrase options (phrases or parts thereof) with the number of words in each phrase and occurrences within the subzibo2 dataframe.

library(tm) 
library(RWeka)
library(plyr)
library(dplyr)

subzibo2 = data.frame(ProdNameReduced = c("zibo muffin fold over x 100", "zibo muffin fold over x 1", "zibo sandwich 250s x 1", "zibo sandwich x 1s", "zibo 500g clamshell punnet x 1",    "zibo burger fold over x 300", "zibo burger fold over x 1", "zibo 500g clamshell punnet x 500s", "zibo 1kg clamshell punnet x 500s", "zibo 1kg clamshell punnet x 1", "zibo 4 cavity fruit tray x 1","zibo 4 cavity fruit tray x 500", "zibo 2 cavity fruit tray x 1", "zibo 2 cavity fruit tray x 1000"), Code = c("ZIBOZFO6BOX", "ZIBOZFO6", "ZIBOSANDWICH", "ZIBOS/WICHSINGL", "ZIBOCS85", "ZIBOBURGERBOX","ZIBOBURGER", "ZIBOBOX500G", "ZIBOBOX1KG", "ZIBO781KG", "ZIBO4LOOSE", "ZIBO4", "ZIBO2LOOSE", "ZIBO2"))

ProdType = function(prodsplit)
{
    prodsplit$ProdNameReduced = as.character(prodsplit$ProdNameReduced)

max_ngram = max(sapply(strsplit(prodsplit$ProdNameReduced, " "), length))

    BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = max_ngram))}

    prodsplit_corpus = Corpus(VectorSource(prodsplit$ProdNameReduced))
    tdm <- TermDocumentMatrix(prodsplit_corpus, control = list(tokenize = BigramTokenizer))
    rm(prodsplit_corpus)
    tdm_matrix = as.matrix(tdm)
    rm(tdm)
    tdm_matrix_rowsums = sort(rowSums(tdm_matrix), decreasing = T)
    rm(tdm_matrix)
    tdm_matrix_rowsums_df = as.data.frame(tdm_matrix_rowsums)
    rm(tdm_matrix_rowsums)
    tdm_matrix_rowsums_df$phrases = row.names(tdm_matrix_rowsums_df)
    rownames(tdm_matrix_rowsums_df) = NULL
    tdm_matrix_rowsums_df$phrasecount = vapply(strsplit(tdm_matrix_rowsums_df$phrases, "\\S+"), length, integer(1))

    colnames(tdm_matrix_rowsums_df) = c("occurence","phrases", "phrasecount")
    tdm_matrix_rowsums_df = ddply(tdm_matrix_rowsums_df, .(phrases), colwise(sum))
    tdm_matrix_rowsums_df = arrange(tdm_matrix_rowsums_df, phrases, occurence)
    tdm_matrix_rowsums_df = select(tdm_matrix_rowsums_df, phrasecount, occurence, phrases)
    tdm_matrix_rowsums_df$selector = character(nrow(tdm_matrix_rowsums_df))

    #write.csv(tdm_matrix_rowsums_df, file = paste("subs/", prodsplit, ".csv", sep = ""))
    write.csv(tdm_matrix_rowsums_df, file = sprintf("subs/.%d.csv" , prodsplit))

}

ProdType(subzibo2)
CallumH
  • 751
  • 1
  • 7
  • 22
  • Just as a test, what happens when you set `file` in `write.csv` equal to "test.csv". Does it run as expected? – giraffehere Dec 16 '15 at 15:38
  • Looks like `prodsplit` is the entire data.frame. Both paste and sprintf take a string as an argument. What do you want to name the file? – Aaron left Stack Overflow Dec 16 '15 at 18:58
  • Hi, @Aaron. I want the file to be called subzibo2.csv please. My aim is to pass a variety of dataframes to this function and for the csv output to be named after each dataframe I pass into the function please. – CallumH Dec 16 '15 at 19:02
  • @giraffehere. Hi. Yes, `write.csv(tdm_matrix_rowsums_df, "subs/test.csv")` works. – CallumH Dec 16 '15 at 19:12
  • See [In R, how to get an object's name after it is sent to a function?](http://stackoverflow.com/q/10520772/210673). – Aaron left Stack Overflow Dec 16 '15 at 19:38
  • Try toString, it's worked for me in the past. `write.csv(tdm_matrix_rowsums_df, file = sprintf("subs/.%d.csv", toString(prodsplit)))` – keberwein Dec 16 '15 at 19:53
  • Thanx @Kernel_Panic but that didnt work. Still throwing an error. Error in sprintf("subs/.%d.csv", toString(prodsplit)) : invalid format '%d'; use format %s for character objects – CallumH Dec 16 '15 at 20:16
  • I can't get deparse(substitute(prodsplit)) to work @Aaron. It is just returning gobbldygook with backslashes all over the place. – CallumH Dec 16 '15 at 20:23
  • Hi @Aaron. The answer in the question you point to returns a character string "a", which is the argument in the function. I'm still not getting 'write.csv' to work and when i test with 'print()' I get 13 lines of code with no ref to suzibo2. It starts like ... [1] structure(list(ProdNameReduced = c("zibo muffin fold over x 100", [2] "zibo muffin fold over x 1", "zibo sandwich 250s x 1", "zibo sandwich x 1s", – CallumH Dec 16 '15 at 21:01
  • Well, you want it to return "subzibo2", which is the argument to your function, right? – Aaron left Stack Overflow Dec 16 '15 at 22:11
  • But I do see something that could be the problem; you have to `deparse(substitute(...))` before you do anything to the argument. So as the first line in your function, put `nm <- deparse(substitute(prodsplit))` and then use that `nm` when you make the output filename. – Aaron left Stack Overflow Dec 16 '15 at 22:13
  • That did the trick! Many many thanks. – CallumH Dec 16 '15 at 22:32

0 Answers0