-3

I attempted to use DBSCAN (from scikit-learn) to cluster text documents. I use TF-IDF (TfidfVectorizer in sklearn) to create the feature of each document.

However, I have not found a way to obtain (print) the documents that are clustered by DBSCAN.

The DBSCAN in sklearn, provides an attribute called 'labels_' which allows us to get the cluster group labels (e.g. 1, 2, 3, -1 for noise). But, I want to get the documents that are clustered by DBSCAN, instead of the cluster group labels.

To emphasize, I want to know what documents that belong to each cluster. Could you please suggest ways to do this?

Thank you very much!

Glorian
  • 127
  • 1
  • 1
  • 10
  • 1
    please provide a small reproducible sample data set and your desired data set – MaxU - stand with Ukraine Jun 12 '18 at 18:57
  • Doesn't `TfidfVectorizer` create a dictionary from the text documents? – rickhg12hs Jun 12 '18 at 21:00
  • The text values within the documents are not clustered. The documents are clustered. The documents are represented by a vector (which contains multiple tf-idf values of the words present in vocabulary). Those vectors are clustered. But the vectors are made from TfidfVectorizer, not DBSCAN. So please clarify what do you want to do. Do you want to see which documents belong to which cluster? Or do you want to see the vocabulary? Or do you want to see the most representative words of a single cluster? – Vivek Kumar Jun 13 '18 at 06:55
  • Hi everyone! thanks for your comments. I have updated the question description. Regarding to the example, I will try to work on it and update the description again. Nevertheless, I hope that the updated question description is already enough to clarify my question. – Glorian Jun 13 '18 at 08:15
  • For that you use the `labels_`. Its in the same order as your original docs. So if `labels = [1, -1, 1, 2, 3, 2]`, this means that the first document from your data belongs to the cluster1, second document is noisy, third document again belongs to cluster1, and so on – Vivek Kumar Jun 13 '18 at 09:06
  • @VivekKumar: Thanks! that's what I am looking for. If you made your comment as the answer to this question, I would make it as the accepted answer :) – Glorian Jun 13 '18 at 11:39

1 Answers1

0

Use the labels to select documents.

X[labels_ == 1,:]

Should be all documents in cluster 1.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194