2

I'm aiming to make a bump chart of word frequency over time. I have about 36000 individual entries of a user's comment and an associated date. I have a 25 user sample available here: http://pastebin.com/kKfby5kf

I'm trying to get the most frequent words (maybe top 10?) on a given date. I feel like my methodology is close, but not quite right:

    library("tm")

frequencylist <- list(0)

for(i in unique(sampledf[,2])){

  subset <- subset(sampledf, sampledf[,2]==i)

  comments <- as.vector(subset[,1])
  verbatims <- Corpus(VectorSource(comments))
  verbatims <- tm_map(verbatims, stripWhitespace)
  verbatims <- tm_map(verbatims, content_transformer(tolower))
  verbatims <- tm_map(verbatims, removeWords, stopwords("english"))
  verbatims <- tm_map(verbatims, removePunctuation)

  stopwords2 <- c("game")
  verbatims2 <- tm_map(verbatims, removeWords, stopwords2)
  dtm <- DocumentTermMatrix(verbatims2)
  dtm2 <- as.matrix(dtm)
  frequency <- colSums(dtm2)
  frequency <- sort(frequency, decreasing=TRUE)
  frequencydf <- data.frame(frequency)
  frequencydf$comments <- row.names(frequencydf)
  frequencydf$date <- i

  frequencylist[[i]] <- frequencydf 
}

An explanation of my madness: the pastebin example goes into sampledf. For each unique date in the sample, I'm trying to get a word frequency. I'm then attempting to store that tabulated word frequency in a list (might not be the best approach, though). First, I subset by date, then strip whitespace, common English words, punctuation, and lowercase it all. I then do another pass of word removal for "game" since it's not too interesting but very common. To get the word frequency, I then pass it into a document term matrix and do a simple colSums(). Then I append the date for that table and try to store it in a list.

I'm not sure if my strategy is valid to begin with. Is there a simpler, better approach to this problem?

phiver
  • 23,048
  • 14
  • 44
  • 56
AI52487963
  • 1,253
  • 2
  • 17
  • 36
  • 1
    Can you `dput()` the data? It seems you have odd characters in there. It also sounds like there's a change `frequency` has length 0 for at least one observation. You should check that. Also, when adding items to a list it should be `frequencylist[[i]] <- frequencydf`, or, even better, this can be replaced with an `lapply`. – MrFlick Oct 22 '15 at 02:28
  • Whoops I noticed a bug in the subset function, just fixed it. Which data do you mean? `dput()` for the subsetted data or the resultant list of dataframes? – AI52487963 Oct 22 '15 at 03:51
  • 1
    A `dput()` is better than a link to an external paste site. See [how to make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Now that you don't have an error, this really isn't a specific question. If your code works but you want someone to review it for style, thats a better fit for [codereview.se], not Stack Overflow. Matters of style do not have a single clear answer to accept. – MrFlick Oct 22 '15 at 03:56

1 Answers1

2

The commenters are correct in that there are better ways to set up a reproducible example. In addition, your answer could be more specific in what you are trying to accomplish as an output. (I could not get your code to execute without error.)

However: You asked for a simpler, better approach. Here is what I think is both. It uses the quanteda text package and exploits the groups feature when creating the document-feature matrix. Then it performs some rankings on the "dfm" to get what you need in terms of daily term rankings.

Note that this is based on my having loaded your linked data using read.delim("sampledf.tsv", stringsAsFactors = FALSE).

require(quanteda)
# create a corpus with a date document variable
myCorpus <- corpus(sampledf$content_strip, 
                   docvars = data.frame(date = as.Date(sampledf$postedDate_fix, "%M/%d/%Y")))

# construct a dfm, group on date, and remove stopwords plus the term "game"
myDfm <- dfm(myCorpus, groups = "date", ignoredFeatures = c("game", stopwords("english")))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: date
## ... lowercasing
## ... tokenizing
## ... indexing documents: 20 documents
## ... indexing features: 198 feature types
## ... removed 47 features, from 175 supplied (glob) feature types
## ... created a 20 x 151 sparse dfm
## ... complete. 
## Elapsed time: 0.009 seconds.

myDfm <- sort(myDfm) # not required, just for presentation
# remove a really nasty long term
myDfm <- removeFeatures(myDfm, "^a{10}", valuetype = "regex")
## removed 1 feature, from 1 supplied (regex) feature types

# make a data.frame of the daily ranks of each feature
featureRanksByDate <- as.data.frame(t(apply(myDfm, 1, order, decreasing = TRUE)))
names(featureRanksByDate) <- features(myDfm)
featureRanksByDate[, 1:10]
##              â great nice play  go will can get ever first
## 2013-10-02   1    18   19   20  21   22  23  24   25    26
## 2013-10-04   3     1    2    4   5    6   7   8    9    10
## 2013-10-05   3     9   28   29   1    2   4   5    6     7
## 2013-10-06   7     4    8   10  11   30  31  32   33    34
## 2013-10-07   5     1    2    3   4    6   7   8    9    10
## 2013-10-09  12    42   43    1   2    3   4   5    6     7
## 2013-10-13   1    14    6    9  10   13  44  45   46    47
## 2013-10-16   2     3   84   85   1    4   5   6    7     8
## 2013-10-18  15     1    2    3   4    5   6   7    8     9
## 2013-10-19   3    86    1    2   4    5   6   7    8     9
## 2013-10-22   2    87   88   89  90   91  92  93   94    95
## 2013-10-23  13    98   99  100 101  102 103 104  105   106
## 2013-10-25   4     6    5   12  16  109 110 111  112   113
## 2013-10-27   8     4    6   15  17  124 125 126  127   128
## 2013-10-30  11     1    2    3   4    5   6   7    8     9
## 2014-10-01   7    16  139    1   2    3   4   5    6     8
## 2014-10-02 140     1    2    3   4    5   6   7    8     9
## 2014-10-03 141   142  143    1   2    3   4   5    6     7
## 2014-10-05 144   145  146  147 148    1   2   3    4     5
## 2014-10-06  17   149  150    1   2    3   4   5    6     7

# top n features by day
n <- 10 
as.data.frame(apply(featureRanksByDate, 1, function(x) {
    todaysTopFeatures <- names(featureRanksByDate)
    names(todaysTopFeatures) <- x
    todaysTopFeatures[as.character(1:n)]
}), row.names = 1:n)
##    2013-10-02 2013-10-04 2013-10-05 2013-10-06 2013-10-07 2013-10-09 2013-10-13 2013-10-16 2013-10-18 2013-10-19 2013-10-22 2013-10-23
## 1           â      great         go     triple      great       play          â         go      great       nice       year       year
## 2         win       nice       will      niple       nice         go    created          â       nice       play          â       give
## 3        year          â          â   backflip       play       will      wasnt      great       play          â       give       good
## 4        give       play        can      great         go        can      money       will         go         go       good       hard
## 5        good         go        get      scope          â        get     prizes        can       will       will       hard       time
## 6        hard       will       ever       ball       will       ever       nice        get        can        can       time     triple
## 7        time        can      first          â        can      first      piece       ever        get        get     triple      niple
## 8      triple        get        fun       nice        get        fun       dead      first       ever       ever      niple   backflip
## 9       niple       ever      great   testical       ever        win       play        fun      first      first   backflip      scope
## 10   backflip      first        win       play      first       year         go        win        fun        fun      scope       ball
##    2013-10-25 2013-10-27 2013-10-30 2014-10-01 2014-10-02 2014-10-03 2014-10-05 2014-10-06
## 1       scope      scope      great       play      great       play       will       play
## 3    testical   testical       play       will       play       will        get       will
## 2        ball       ball       nice         go       nice         go        can         go
## 4           â      great         go        can         go        can       ever        can
## 5        nice       shot       will        get       will        get      first        get
## 6       great       nice        can       ever        can       ever        fun       ever
## 7        shot       head        get          â        get      first        win      first
## 8        head          â       ever      first       ever        fun       year        fun
## 9     dancing    dancing      first        fun      first        win       give        win
## 10        cow        cow        fun        win        fun       year       good       year

BTW interesting spellings of niple and testical.

Ken Benoit
  • 14,454
  • 27
  • 50