I have a requirement to find correlation among every term in a term document matrix. The number of terms in the matrix is 181841 and docs are 191431. I need to get the correlation coefficient for every term with the other term.
I have used a for loop and the below code to get it for each term. Then using rbind I am making it a single dataframe.
Edit 1: A small reproducible example is below.
clean_CTP
TP_ID Keywords
1 A,B,C,D
2 A,L,K,M
3 P,B,L,M
library(qdap)
text_corpus <- Corpus(VectorSource(clean_CTP[,2]))
doc_term_mat <- TermDocumentMatrix(text_corpus)
selected_words <- findFreqTerms(doc_term_mat, lowfreq = 1)
filt_doc_term_mat= doc_term_mat[selected_words,]
df_pear = data.frame(main_term= character(),
rel_term = character(),
corr = numeric())
for(i in 1:length(selected_words)){
term_con = selected_words[i]
ass = findAssocs( filt_doc_term_mat, term_con,pearson_thres)
ass_df = as.data.frame(ass)
main_term = rep(selected_words[i],nrow(ass_df))
rel_term = as.vector(rownames(ass_df))
corr = as.vector(ass_df[,1])
df_new = data.frame(main_term,rel_term,corr)
df_pear = rbind(df_pear,df_new)
}
However, this is taking huge time to execute. i.e around 5 mins for each term. Is there any better way to get this done.