0

I am trying to cluster groups of ideas, each one as a reference. Each rows contain an idea, the csv looks like this:

library(tm)
setwd("/Users/Bif/Documents")
#read the data
data<-read.csv("ideas.csv", header=T, sep=";")
> data
 Reference                                                   idea
1 FI-000786          AIRE DE DETENTE LES BEAUX JOURS ARRIVENT etc…
2 FI-000754   Tiroirs de rangement des véhicules  les tiroirs etc…
3 FI-000740   EVITER LES PI Vaines sur sur les dossiers MOAR etc..
4 FI-000717    Glossaire de sigleset trigrammes ucf beaucoup  etc…
5 FI-000705        Transport de l'escabeau Bruit et accès de  etc…
6 FI-000669  economie de papier  C.Q.P (avis de passage avec  etc…
7 FI-000653          UTILISATION D 'UNE CAMERA D'INSPECTION  etc..
8 FI-000649  faciliter les déclarations de SD par les agents  etc…
9 FI-000639 Récup Embase téléreport sur coffret Des coffrets  etc…

I'm quite new with R. I've been trying with the text-mining tm-package and I can analyze the terms frequencies of the second column via a DoumentTermMatrix, the problem is with this process I'm only able to analyse it as if it was a plain text, not as different groups of text that I could compare afterwards and tell which references are similar to each others.

I've seen there is this qpad package topic which might get close to what I am looking for (even though I can't make it to load the package, don't know why yet..) but I can't figure out how I would cluster each references (dates in the link example) together anyway.

I've been searching quite a lot in on the web, I feel stuck now...

Thank you a lot.

Community
  • 1
  • 1
  • You could paste all idea's per reference together, making a longer text and use that in the package tm? – Wannes Rosiers Jul 31 '15 at 08:32
  • You should try to factor the column. See http://stackoverflow.com/questions/9251326/convert-data-frame-column-format-from-character-to-factor for details – Joe Chakra Jul 31 '15 at 08:35
  • @WannesRosiers Well if I put it all together in one text I don't think I will be able to distinct the reference in the end (It's kinda what I did in the first place with tm). Plus it would be more difficult to compare the idea content I think. – Fabien R Jul 31 '15 at 12:07
  • @JoeChakra I'm not sure I understood how it works I'll try to dig it more.. The two main problems I have are, first I'm struggling to do a comparison of similarities between those ideas. - Second , even if I figure it out I think the text minig operation will prevent me from keeping the link reference/idea. – Fabien R Jul 31 '15 at 12:15
  • You may want to look at topic modeling (like the topicmodel package or LDA package). Those capabilities can cluster your text into topics. – lawyeR Jul 31 '15 at 12:54
  • @lawyeR it's not exactly what I'm looking for since I don't want to find topics but simply cluster the most similar ideas (portion of text). But I find it is not not that simple in the end... it's kind of a k-mean except k-mean is unfortunately only for numerics. I have found ways to get the frequencies of words (with tm-package) but then I struggle to correlate this info to tell which portion of text is close to another and forms a group.And all of this by keeping the reference in mind. – Fabien R Aug 03 '15 at 07:08
  • @lawyeR If I can find the topics out of it, do you think it's possible to associate the references to each topic afterwards? I doubt it is feasable.. – Fabien R Aug 03 '15 at 09:36

1 Answers1

0

Create a DocumentTermMatrix with tm, then turn that DocumentTermMatrix into a DataFrame with as.data.frame(as.matrix(mydtm)), then cbind() the reference column back to the new data frame. Alternatively cbind the converted DoctermMatrix (->DataFrame) back to your original data frame for further processing.

knb
  • 9,138
  • 4
  • 58
  • 85
  • I found something wrong with my DocumentTermMatrix I only have 34 Terms in it, which is completely wrong: ` > corp<-Corpus(DataframeSource(data)) > dtm <- DocumentTermMatrix(corp) > dim(dtm) [1] 133 34 ` – Fabien R Aug 04 '15 at 14:05