Create a term-frequency matrix from a double column occurrences matrix

Question

I have an input tab table of occurrences of words and documents:

#       WORDS          DOCUMENTS
1  chr1-1-5872           A_1
2  chr1-5873-14436       A_2
3 chr1-14437-17846       A_3
4 chr1-17847-20294       A_2
5 chr1-20295-22639       A_5

And I want to get a frequency matrix, in which I have all the words as rows, all the documents names as columns, and each entry is the number of times that word is found associated with that document:

#                       A_1   A_2  A_3   A_4    A_5
1  chr1-1-5872           1     1    0     0      0
2  chr1-5873-14436       0     0    0     0      0 
3 chr1-14437-17846       0     0    1     0      0 
4 chr1-17847-20294       0     1    0     0      0 
5 chr1-20295-22639       0     0    0     0      0

I tried with the following command:

 result <- t(with(tab, wfm(tab$WODS, tab$DOCUMENTS)))

But all I got was

             A_1 A_2 A_3 A_5
grouping.var   1   2   1   1

What I'm doing wrong? How could I get my matrix with row names as requested?

I feel like you forgot to include some vital information – Rich Scriven Oct 16 '14 at 23:27 — Rich Scriven, Oct 16 '14 at 23:27

score 4 · Answer 1 · edited Oct 03 '16 at 06:53

Using table function:

table(df)
#                    DOCUMENTS
# WORDS             A_1 A_2 A_3 A_5
#  chr1-1-5872        1   0   0   0
#  chr1-14437-17846   0   0   1   0
#  chr1-17847-20294   0   1   0   0
#  chr1-20295-22639   0   0   0   1
#  chr1-5873-14436    0   1   0   0

We can also add as.data.frame.matrix to have it in data.frame class:

as.data.frame.matrix(table(df))
#                  A_1 A_2 A_3 A_5
# chr1-1-5872        1   0   0   0
# chr1-14437-17846   0   0   1   0
# chr1-17847-20294   0   1   0   0
# chr1-20295-22639   0   0   0   1
# chr1-5873-14436    0   1   0   0

Or using dcast function (just for general knowledge):

library(reshape2)
dcast(df, WORDS ~ DOCUMENTS, length)
#              WORDS A_1 A_2 A_3 A_5
# 1      chr1-1-5872   1   0   0   0
# 2 chr1-14437-17846   0   0   1   0
# 3 chr1-17847-20294   0   1   0   0
# 4 chr1-20295-22639   0   0   0   1
# 5  chr1-5873-14436   0   1   0   0

Tyler Rinker · Answer 2 · 2014-10-16T23:39:30.813

I believe you're using the qdap package. If you're text really looks like that (i.e., each row is actually a single word) then wfm is overkill and you'd need to change a bunch of arguments to avoid stripping of data. You're more interested in reshaping the data. Here's an approach:

library(qdap)
as.wfm(with(tab, mtabulate(setNames(DOCUMENTS, WORDS))))

##                  A_1 A_2 A_3 A_5
## chr1-1-5872        1   0   0   0
## chr1-5873-14436    0   1   0   0
## chr1-14437-17846   0   0   1   0
## chr1-17847-20294   0   1   0   0
## chr1-20295-22639   0   0   0   1

Create a term-frequency matrix from a double column occurrences matrix

2 Answers2