1

I have a requirement to find correlation among every term in a term document matrix. The number of terms in the matrix is 181841 and docs are 191431. I need to get the correlation coefficient for every term with the other term.

I have used a for loop and the below code to get it for each term. Then using rbind I am making it a single dataframe.

Edit 1: A small reproducible example is below.

  clean_CTP

  TP_ID  Keywords
   1     A,B,C,D
   2     A,L,K,M
   3     P,B,L,M

 library(qdap)

  text_corpus <- Corpus(VectorSource(clean_CTP[,2]))

  doc_term_mat <- TermDocumentMatrix(text_corpus)

  selected_words <- findFreqTerms(doc_term_mat, lowfreq = 1)

  filt_doc_term_mat= doc_term_mat[selected_words,]

 df_pear = data.frame(main_term= character(),
                  rel_term = character(),
                  corr = numeric())

for(i in 1:length(selected_words)){

    term_con = selected_words[i]

    ass = findAssocs( filt_doc_term_mat, term_con,pearson_thres)

    ass_df = as.data.frame(ass)

    main_term = rep(selected_words[i],nrow(ass_df))

    rel_term = as.vector(rownames(ass_df))

    corr = as.vector(ass_df[,1])

    df_new = data.frame(main_term,rel_term,corr)

    df_pear = rbind(df_pear,df_new)

  }

However, this is taking huge time to execute. i.e around 5 mins for each term. Is there any better way to get this done.

NinjaR
  • 621
  • 6
  • 22
  • Show the code you're using. Preferably, make a small reproducible example. Perhaps this could be done in parallel. – Roman Luštrik Mar 11 '17 at 08:26
  • @RomanLuštrik-Edited the question..Thanks – NinjaR Mar 11 '17 at 08:44
  • Great, you can further improve your question by making it easy for us [to copy/paste the code](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and know what the expected result is. – Roman Luštrik Mar 11 '17 at 10:39
  • 1
    Terms that appear in only one or two documents won't have any meaningful associations. Ignore that requirement and reduce the terms you are analyzing to something more meaningful by increasing your minimum document requirement (say at least 5 documents). This will increase speed and focus on meaningful results. – lmkirvan Mar 11 '17 at 14:55
  • The numbers mentioned in the question are post filtering. So that is done. But still it is huge time to execute. – NinjaR Mar 12 '17 at 04:04
  • Possible duplicate of [R: Calculate cosine distance from a term-document matrix with tm and proxy](http://stackoverflow.com/questions/29750519/r-calculate-cosine-distance-from-a-term-document-matrix-with-tm-and-proxy) – emilliman5 Mar 13 '17 at 17:28

0 Answers0