0

New to R ... struggling to produce results on 10,000 lines; Data model actually has about 1M lines. Is there a better option than a Loop? Read about vectorization and attempted tapply with no success.

Data set has a column of free form text and a category associated to the text. I need to parse the text into distinct words to then perform statistics on the frequency of words being able to predict the category with a certain degree of accuracy. I read in the data via read.table and create a data.frame called data.

Function attempts to parse Text, and count occurrences of each word:

data <- data.frame(category = c("cat1","cat2","cat3", "cat4"), 
                   text = c("The quick brown fox", 
                            "Jumps over the fence", 
                            "The quick car hit a fence",
                            "Jumps brown"))

parsefunc <- function(data){
    finalframe <- data.frame()
    for (i in 1:nrow(data)){
    description <- strsplit(as.character(data[i,2]), " ")[[1]]
    category <- rep(data[i,1], length(description))
    worddataframe <- data.frame(description, category)
    finalframe <- rbind(finalframe, worddataframe)
    }
m1<- ddply(finalframe, c("description","category"), nrow)
m2<- ddply(m1, 'description', transform, totalcount = sum(nrow), percenttotal = nrow/sum(nrow))
m3 <- m2[(m2$totalcount>10) & (m2$percenttotal>0.8), ]
m3
}
Alison
  • 1
  • 1
  • Conventional wisdom is that for loops are to be avoided in R. Opt instead for one of the apply functions instead. http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega – erasmortg Jul 28 '15 at 18:29
  • 3
    The conventional wisdom is simply wrong. Loops are just as fast as `*apply` functions. What different is that you can more easily understand the `*apply` code. The use of dataframes as containers for irregular sized collections of words is going be problematic as well as slow.I would imagine that asking for `data[i,7]` when there are only two columns in data will throw an error. Furthermore supplying `cat` as a second argument to data.frame should also throw an error since `cat` is a function. Completely unclear is want "V1" might be. Why not use pkg:tm? – IRTFM Jul 28 '15 at 18:40
  • OP, your `data` object doesn't have 7 columns and there isn't any `cat` (I presume that you mean `category`). Please, edit the question to make it reproducible. @BondedDust I wouldn't say that loops are as fast as *apply. `vapply` in particular can be significantly faster than a loop in many instances. – nicola Jul 28 '15 at 18:53
  • But I think vapply won't allow you to return objects of varying length. – IRTFM Jul 28 '15 at 19:41

1 Answers1

1

This will get your finalframe and do something close to your m1,2, and 3 part. You'll have to edit it to do exactly what you want. I used a longer data set of 40k rows to make sure it performs alright:

# long data set
data <- data.frame(Category = rep(paste0('cat',1:4),10000),
                Text = rep(c('The quick brown fox','Jumps over the fence','The quick car hit a fence','Jumps brown cars'),10000),stringsAsFactors = F)

# split into words
wordbag <- strsplit(data$Text,split = ' ')

# find appropriate category for each word
categoryvar <- rep(data$Category,lapply(wordbag,length))

# stick them in a data frame and aggregate
newdf <- data.frame(category = categoryvar,word = tolower(unlist(wordbag)))
agg <- aggregate(list(wordcount = rep(1,nrow(newdf))),list(category = newdf$category,word =newdf$word),sum)

# find total count in entire data set and put in data set
wordagg <- aggregate(list(totalwordcount = rep(1,nrow(newdf))),list(word =newdf$word),sum)
agg <- merge(x = agg,y = wordagg,by = 'word')

# find percentages and do whatever else you need
agg$percentageofword <- agg$wordcount/agg$totalwordcount
ARobertson
  • 2,857
  • 18
  • 24
  • This is brilliant ARobertson ... worked like a charm! Any advice for generating stats on combinations of words within 'Text' field (i.e. 'fox' may not be indicative, yet combination of 'brown fox' is very informative) ... thanks again - saved me hours/days! – Alison Jul 28 '15 at 20:51
  • @Alison I only did a little for a project a long time ago, and I found ngrams a pain. But here is some help I found interesting [link](http://stackoverflow.com/questions/8161167/what-algorithm-i-need-to-find-n-grams) – ARobertson Jul 28 '15 at 23:49