I'm looking for a bit of help filling in the numerous blanks in my experience when it comes to using igraph
package in R to create a graph of the common words used by 5 twitter accounts. My aim is to see which keywords the accounts share and identify others that are common to 1 account and not others.
I've created a wordcloud from the tweet text but I'd appreciate any help from the communities in converting (if possible) that to a graph. So far I have a TermDocumentMatrix
using the tm
package showing the frequency from the wordcloud
and I'd like to incorporate that frequency data into the final plot as well.
I'm not sure what format my data needs to be or where I should start (dataframe, Corpus, Matrix) Any pointers?
This is what manipulations I have done to clean and process the data
Clean up the tweet text stored in a dataframe called tweetsDF and column "text" to find the common words used. Start by removing the hashtags from the text using the qdapRegex
package:
Text <- rm_hash(tweetsDF$text, clean=TRUE, trim=TRUE)
# Remove the twitter shortened urls using the qdapRegex package
TextNoShortURL <-rm_twitter_url(Text, trim = TRUE, clean = TRUE,extract = FALSE)
# Create a Term Document Matrix but remove the Punctuation, common english words and exclude "and" "the" "for" using the tm package
TextTDM = TermDocumentMatrix(TextCorpus,control = list(removePunctuation =TRUE,stopwords("english"),stopwords =c("the","for","and"),removeNumbers = FALSE))
# Convert it to a Matrix
TextMatrix <- as.matrix(TextTDM)
# Get the frequency of the words found in the text
MainWord_freqs = sort(rowSums(TextMatrix), decreasing=TRUE)
# Convert it to a dataframe
TextDF <- data.frame(word=names(MainWord_freqs), freq=MainWord_freqs)
# And you end up with a dataframe contains each word and how often it was included in text, the twitter handle isn't included here but I assume I can mutate the dataframe to include it
<PRE>
row.names word freq
1 shop shop 8765
2 food food 924
3 drink drink 8273
..
</PRE>
I'm not sure where to go from here so I have a data source suitable for igraph that will allow me to associate the twitter handle XYZ with the main words used