1

I want to text mine and for multiple reasons I have built a data frame where I have words in one column and frequency in the second example:

 words freq
 Have   123
 have    5
 having 4589

Note we can quickly see if the frequency is very large that doing it this way may be more efficient for transforming words rather than having a corpus with certain words repeated many many times.

I would like to use tm to transform the words using tolower, stemDocument etc

I know I can pull the words column out of the data frame into a corpus, but then I will lose the frequency information.

I would like to get:

 words freq
 have   123
 have    5
 have  4589

Then I think I can use setDT, the dplyr package or aggregate to get to:

words freq
have  4717

I plan to do this on a large data frame. Thanks

I did try to mimic tm: read in data frame, keep text id's, construct DTM and join to other dataset

Community
  • 1
  • 1
Oli
  • 532
  • 1
  • 5
  • 26

1 Answers1

3

No need for a text analysis package here, you can do it using tolower() and wordStem() from the SnowballC package. The use of data.table makes it very fast as well.

require(data.table)
dt <- data.table(words = c("Have", "have", "having"),
                 freq = c(123, 5, 4589))

# transform to lowercase
dt[, words := tolower(words)]

# stem the words
dt[, words := SnowballC::wordStem(words)]

dt
##    words freq
## 1:  have  123
## 2:  have    5
## 3:  have 4589

# aggregate on same lowercased stems
dt[, list(freq = sum(freq)), by = words]
##    words freq
## 1:  have 4717

My version of data.table:

packageVersion("data.table")
## [1] ‘1.9.6’
Ken Benoit
  • 14,454
  • 27
  • 50
  • Havent aggregated yet. It kind of works, however experience became experi, business became busi, finance became financ, financial became financi. Basically a load of words have been cut short. – Oli May 20 '16 at 15:39
  • then agg: df2[, list(df2$AC=sum(df2$AC)), by=df2$Row.Labels] Error: unexpected '=' in "df2[, list(df2$AC= or df2[, list(AC=sum(AC)), by=Row.Labels] Error in [.data.frame`(df2, , list(AC = sum(AC)), by = Row.Labels) : unused argument (by = Row.Labels) – Oli May 20 '16 at 15:42
  • Your example doesnt work for me: Error in `[<-.data.table`(x, j = name, value = value) : RHS of assignment to existing column 'words' is zero length but not NULL. If you intend to delete the column use NULL. Otherwise, the RHS must have length > 0; e.g., NA_integer_. If you are trying to change the column type to be an empty list column then, as with all column type changes, provide a full length RHS vector such as vector('list',nrow(DT)); i.e., 'plonk' in the new column. – Oli May 20 '16 at 15:50
  • using data.table: dt[, list(AC=sum(AC)), by=Row.Labels] Row.Labels AC 1: will NA 2: experi NA 3: role 1710 4: account NA 5: busi NA – Oli May 20 '16 at 15:55
  • same package version and str(dt): Classes ‘data.table’ and 'data.frame': 37075 obs. of 40 variables: $ Row.Labels: chr "will" "experi" "role" "account" ... $ AC : int 2431 1800 1664 1518 1428 1293 1238 1223 1216 1206 ... – Oli May 20 '16 at 16:05
  • Sorry, my bad, I have now amended my answer for fast transformations of the text by reference, the data.table way. – Ken Benoit May 20 '16 at 16:11
  • Okay, it does turn have and having into both have. But management was chopped into manag and finance and financial where chopped into financ and financi. Guessing this is an issue with the package? – Oli May 23 '16 at 07:54
  • 1
    That's just how the Porter stemmer implementation in **SnowballC** works. – Ken Benoit May 23 '16 at 08:37
  • Can you expand to removing stopwords too, ive played around but cant figure it out - I would want to remove the whole row rather than just make the entry in that column null – Oli May 23 '16 at 12:44