New to R ... struggling to produce results on 10,000 lines; Data model actually has about 1M lines. Is there a better option than a Loop? Read about vectorization and attempted tapply with no success.
Data set has a column of free form text and a category associated to the text. I need to parse the text into distinct words to then perform statistics on the frequency of words being able to predict the category with a certain degree of accuracy. I read in the data via read.table and create a data.frame called data.
Function attempts to parse Text, and count occurrences of each word:
data <- data.frame(category = c("cat1","cat2","cat3", "cat4"),
text = c("The quick brown fox",
"Jumps over the fence",
"The quick car hit a fence",
"Jumps brown"))
parsefunc <- function(data){
finalframe <- data.frame()
for (i in 1:nrow(data)){
description <- strsplit(as.character(data[i,2]), " ")[[1]]
category <- rep(data[i,1], length(description))
worddataframe <- data.frame(description, category)
finalframe <- rbind(finalframe, worddataframe)
}
m1<- ddply(finalframe, c("description","category"), nrow)
m2<- ddply(m1, 'description', transform, totalcount = sum(nrow), percenttotal = nrow/sum(nrow))
m3 <- m2[(m2$totalcount>10) & (m2$percenttotal>0.8), ]
m3
}