3

I've a term-document sparse matrix made iusing the tm package in R

I can convert to a term-term matrix using this snippet of code:

library("tm")
data(crude)
couple.of.words <- c("embargo", "energy", "oil", "environment", "estimate")
tdm <- TermDocumentMatrix(crude, control = list(dictionary = couple.of.words))    
tdm.matrix <- as.matrix(tdm)
tdm.matrix[tdm.matrix>=1] <- 1
tdm.matrix <- tdm.matrix %*% t(tdm.matrix)

but it's not what I really need, since I have to build a data frame suitable to be loaded in a network analysis tool like Gephi. This data frame should ideally have three columns:

{term1, term2, number of docs where term1 and term2 co-occur}

For example (not from the real data provided in the example above) if the word "embargo" and "energy" co-occur in three documents (this can be seen in the tdm matrix, where each document fits a column), i have a row like that:

+-----------+-------------+------+
| term1     | term 2      | Freq |
+-----------+-------------+------+
| oil       | energy      |  3   |
+-----------+-------------+------+

how can I build this nodes/edges dataframe from the term-document or the term-term matrix?

Gabriele B
  • 2,665
  • 1
  • 25
  • 40
  • Please supply a minimal [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so we can see the classes and structures of the objects involved. If you give sample data, also give desired ouput so we can test various strategies. – MrFlick Sep 11 '14 at 13:03
  • Added some example code and put some emphasies on the desired output – Gabriele B Sep 11 '14 at 13:52

1 Answers1

3

Sounds like you can get what you need if you add one more line of code

desired <- as.data.frame(as.table(tdm.matrix))
head(desired)

#         Terms Terms.1 Freq
# 1     embargo embargo    8
# 2      energy embargo    6
# 3 environment embargo    2
# 4    estimate embargo    4
# 5         oil embargo   44
# 6     embargo  energy    6

The as.table() really only changes the class. And it just so happens that there is an existing as.data.frame.table() method that flattens tables into frequency listings like you desire.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • it works perfectly; I'm just wondering if there is a easy way to get rid of permutations ie. the second and the sixth row in the above example: it's the same relation, actually, but reversed. Think this would help but not sure: http://stackoverflow.com/questions/14078507/remove-duplicated-2-columns-permutations – Gabriele B Sep 12 '14 at 09:41