1

I have an input tab table of occurrences of words and documents:

#       WORDS          DOCUMENTS
1  chr1-1-5872           A_1
2  chr1-5873-14436       A_2
3 chr1-14437-17846       A_3
4 chr1-17847-20294       A_2
5 chr1-20295-22639       A_5

And I want to get a frequency matrix, in which I have all the words as rows, all the documents names as columns, and each entry is the number of times that word is found associated with that document:

#                       A_1   A_2  A_3   A_4    A_5
1  chr1-1-5872           1     1    0     0      0
2  chr1-5873-14436       0     0    0     0      0 
3 chr1-14437-17846       0     0    1     0      0 
4 chr1-17847-20294       0     1    0     0      0 
5 chr1-20295-22639       0     0    0     0      0 

I tried with the following command:

 result <- t(with(tab, wfm(tab$WODS, tab$DOCUMENTS)))

But all I got was

             A_1 A_2 A_3 A_5
grouping.var   1   2   1   1

What I'm doing wrong? How could I get my matrix with row names as requested?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
DavideChicco.it
  • 3,318
  • 13
  • 56
  • 84

2 Answers2

4

Using table function:

table(df)
#                    DOCUMENTS
# WORDS             A_1 A_2 A_3 A_5
#  chr1-1-5872        1   0   0   0
#  chr1-14437-17846   0   0   1   0
#  chr1-17847-20294   0   1   0   0
#  chr1-20295-22639   0   0   0   1
#  chr1-5873-14436    0   1   0   0

We can also add as.data.frame.matrix to have it in data.frame class:

as.data.frame.matrix(table(df))
#                  A_1 A_2 A_3 A_5
# chr1-1-5872        1   0   0   0
# chr1-14437-17846   0   0   1   0
# chr1-17847-20294   0   1   0   0
# chr1-20295-22639   0   0   0   1
# chr1-5873-14436    0   1   0   0

Or using dcast function (just for general knowledge):

library(reshape2)
dcast(df, WORDS ~ DOCUMENTS, length)
#              WORDS A_1 A_2 A_3 A_5
# 1      chr1-1-5872   1   0   0   0
# 2 chr1-14437-17846   0   0   1   0
# 3 chr1-17847-20294   0   1   0   0
# 4 chr1-20295-22639   0   0   0   1
# 5  chr1-5873-14436   0   1   0   0
zx8754
  • 52,746
  • 12
  • 114
  • 209
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
2

I believe you're using the qdap package. If you're text really looks like that (i.e., each row is actually a single word) then wfm is overkill and you'd need to change a bunch of arguments to avoid stripping of data. You're more interested in reshaping the data. Here's an approach:

library(qdap)
as.wfm(with(tab, mtabulate(setNames(DOCUMENTS, WORDS))))

##                  A_1 A_2 A_3 A_5
## chr1-1-5872        1   0   0   0
## chr1-5873-14436    0   1   0   0
## chr1-14437-17846   0   0   1   0
## chr1-17847-20294   0   1   0   0
## chr1-20295-22639   0   0   0   1
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519