0

I'm looking for a bit of help filling in the numerous blanks in my experience when it comes to using igraph package in R to create a graph of the common words used by 5 twitter accounts. My aim is to see which keywords the accounts share and identify others that are common to 1 account and not others.

I've created a wordcloud from the tweet text but I'd appreciate any help from the communities in converting (if possible) that to a graph. So far I have a TermDocumentMatrix using the tm package showing the frequency from the wordcloud and I'd like to incorporate that frequency data into the final plot as well.

I'm not sure what format my data needs to be or where I should start (dataframe, Corpus, Matrix) Any pointers?

This is what manipulations I have done to clean and process the data

Clean up the tweet text stored in a dataframe called tweetsDF and column "text" to find the common words used. Start by removing the hashtags from the text using the qdapRegex package:

Text <- rm_hash(tweetsDF$text, clean=TRUE, trim=TRUE)
# Remove the twitter shortened urls using the qdapRegex package

TextNoShortURL <-rm_twitter_url(Text, trim = TRUE, clean = TRUE,extract = FALSE)

# Create a Term Document Matrix but remove the Punctuation, common english words and exclude "and" "the" "for" using the tm package

TextTDM = TermDocumentMatrix(TextCorpus,control = list(removePunctuation =TRUE,stopwords("english"),stopwords =c("the","for","and"),removeNumbers = FALSE))

# Convert it to a Matrix

TextMatrix <- as.matrix(TextTDM)

# Get the frequency of the words found in the text

MainWord_freqs = sort(rowSums(TextMatrix), decreasing=TRUE) 

# Convert it to a dataframe

TextDF <- data.frame(word=names(MainWord_freqs), freq=MainWord_freqs)

# And you end up with a dataframe contains each word and how often it was included in text, the  twitter handle isn't included here but I assume I can mutate the dataframe to include it  

<PRE>
             row.names        word        freq
1            shop             shop        8765
2            food             food        924
3            drink            drink        8273
..
</PRE>

I'm not sure where to go from here so I have a data source suitable for igraph that will allow me to associate the twitter handle XYZ with the main words used

989
  • 12,579
  • 5
  • 31
  • 53
mobcdi
  • 1,532
  • 2
  • 28
  • 49
  • For the community to help, you need to provide some data and your code so far. – lawyeR Jul 06 '15 at 10:19
  • I'm not able to attach or link to the text but if you assume I start out with a dataframe containing the tweets with the text of the tweets stored in tweetsDF$text. I use qdapRegex and tm packages to extract and clean the text – mobcdi Jul 06 '15 at 11:08
  • Generally, you should abstract from your problem a bit and provide reproducible example code, which people can copy, paste & run. This makes it a bit more difficult for you, but easier for all others. Here's how to do it: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – lukeA Jul 06 '15 at 11:14
  • Note that my code is reproducible - you can copy, paste & run ist in R and reproduce the output more or less, whereas yours still isn't. For example, there's still example data missing. Please read the info provided by the link above. – lukeA Jul 06 '15 at 12:15

1 Answers1

1

Maybe try some sort of a bipartite graph like this:

library(igraph)
library(tm)
library(reshape2)
tweets <- c("This is a test", "This is another test", "blah")
mat <- as.matrix(TermDocumentMatrix(Corpus(VectorSource(tweets))))
g <- graph.data.frame(subset(melt(mat), !!value, -value), directed = FALSE)
V(g)$color <- rep(2:3, dim(mat))
plot(g)

enter image description here


Add:

library(igraph)
library(tm)
library(reshape2)
tweets <- c("This is a test test test test test", "This This is another test", "blah")
mat <- as.matrix(TermDocumentMatrix(Corpus(VectorSource(tweets))))
g <- graph.data.frame(subset(melt(mat, value.name = "width"), !!width), directed = FALSE)
V(g)$color <- rep(2:3, dim(mat))
plot(g)

enter image description here

lukeA
  • 53,097
  • 5
  • 97
  • 100
  • Hi lukeA, your comment about melt helped me move on my thinking. if I a dataframe with the structure df$word,df$Tweeter,df$freq how would I create a graph that showed stronger relations between word and tweeter for higher values of frequency – mobcdi Jul 06 '15 at 12:05
  • You can specify the edge widths in `E(g)$width` or map them in the plot command using `plot(..., edge.width = my_count_var, ...)`. See my edit above. (Note the newly added repetitions in `tweets`.) – lukeA Jul 06 '15 at 12:12
  • Oops, I thought I saved the edit, but it seems I didn't - so here it's again (2nd screenshot)... – lukeA Jul 06 '15 at 12:49
  • would Mygraph <- graph.data.frame(TextDF.melt) THEN Mygraph$weight <- TextDF.melt$value work as well? – mobcdi Jul 06 '15 at 13:58
  • Try it out. `E(Mygraph)$weight <- ...` and `plot(Mygraph, edge.width =E(Mygraph)$weight)`. – lukeA Jul 06 '15 at 14:15