How can i get highest frequency terms out of TD-idf vectors , for each files in scikit-learn?

Question

I am trying to get Highest frequency terms out of vectors in scikit-learn. From example It can be done using this for each Categories but i want it for each files inside categories.

https://github.com/scikit-learn/scikit-learn/blob/master/examples/document_classification_20newsgroups.py

    if opts.print_top10:
        print "top 10 keywords per class:"
        for i, category in enumerate(categories):
            top10 = np.argsort(clf.coef_[i])[-10:]
            print trim("%s: %s" % (
            category, " ".join(feature_names[top10])))

I want to do this for each files from testing dataset instead of each categories. Where should i be looking?

Thanks

EDIT: s/discrimitive/highest frequency/g (Sorry for the confusions)

Can't you just transform your test data with the same vectorizer that was used to parse the training data. The vectorizer stores the vocabulary after a call to `fit` and `transform` uses that vocabulary to filter any data you pass in (according to the docs). — Matti Lyra, Nov 10 '12 at 18:24
Vocabulary do not store anything about from which document (or array/list index) it gets from. It just Volcabulary , if you look into scikit-learn source code you will see. — Phyo Arkar Lwin, Nov 12 '12 at 12:52

score 4 · Accepted Answer · edited Nov 12 '12 at 14:18

4

You can use the result of transform together with get_feature_names to obtain the term counts for a given document.

X = vectorizer.transform(docs)
terms = np.array(vectorizer.get_feature_names())
terms_for_first_doc = zip(terms, X.toarray()[0])

edited Nov 12 '12 at 14:18

Fred Foo

355,277
75
744
836

answered Nov 12 '12 at 13:08

Andreas Mueller

27,470
8
62
74

1

Tested and corrected. I was about to post almost the same answer :) – Fred Foo Nov 12 '12 at 13:10
get_feature_names means vectorizer.get_feature_names() ? – Phyo Arkar Lwin Nov 12 '12 at 13:21
`terms = np.array(vectorizer.get_feature_names())` `first_top = zip(terms, X_test.toarray()[0])` this don't work yet. – Phyo Arkar Lwin Nov 12 '12 at 13:36
it retrieves all the avaliable terms argh! – Phyo Arkar Lwin Nov 12 '12 at 14:10
@V3ss0n: what did you expect it to retrieve? – Fred Foo Nov 12 '12 at 14:18
i want to retrieve top discrimitive terms for each document. For example here is what i get from my modified code, from `term_counts_per_doc property` of CountVectorizer : `vectorizer.test_term_counts_per_doc[0].most_common(5)` : `[(u'os', 8), (u'edu', 6), (u'comp', 6), (u'netcom', 4), (u'542b', 4)]` – Phyo Arkar Lwin Nov 12 '12 at 14:28
Note that my change introduces some inefficiency since it densifies the entire matrix. Densifying a single sample can be done with `X[0,:].toarray().ravel()`. – Fred Foo Nov 12 '12 at 14:37
1

@V3ss0n: those aren't *discriminative* terms, those are just terms with high frequency. Use `sorted`, `heap.nlargest` or whatever Python trick you prefer to get the terms you want out of `terms_for_first_doc`: http://stackoverflow.com/a/13070505/166749 – Fred Foo Nov 12 '12 at 14:53
Output like yours can be easily achieved using something like inds = np.argsort(X_train.toarray()[0])[-10:] print(zip(vectorizer.get_feature_names(), X_train.toarray()[0][inds])) – Andreas Mueller Nov 12 '12 at 15:00
sorry so its terms with high freq then. I am not native english speaker so i read 20_newsgroup example and thought those are called Most Discrimitive terms . – Phyo Arkar Lwin Nov 12 '12 at 15:28
Finally got this : ``sorted(terms_for_first_doc,key=lambda tup: tup[1],reverse=True)[:10]`` – Phyo Arkar Lwin Nov 12 '12 at 20:39

Phyo Arkar Lwin · Answer 2 · 2012-11-07T11:14:31.210

Seems nobody know . I am answering here as other people face the same problem , i got where to look for now , have not fully implement it yet.

it lies deep inside CountVectorizer from sklearn.feature_extraction.text :

def transform(self, raw_documents):
    """Extract token counts out of raw text documents using the vocabulary
    fitted with fit or the one provided in the constructor.

    Parameters
    ----------
    raw_documents: iterable
        an iterable which yields either str, unicode or file objects

    Returns
    -------
    vectors: sparse matrix, [n_samples, n_features]
    """
    if not hasattr(self, 'vocabulary_') or len(self.vocabulary_) == 0:
        raise ValueError("Vocabulary wasn't fitted or is empty!")

    # raw_documents can be an iterable so we don't know its size in
    # advance

    # XXX @larsmans tried to parallelize the following loop with joblib.
    # The result was some 20% slower than the serial version.
    analyze = self.build_analyzer()
    term_counts_per_doc = [Counter(analyze(doc)) for doc in raw_documents] # <<-- added here
    self.test_term_counts_per_doc=deepcopy(term_counts_per_doc)
    return self._term_count_dicts_to_matrix(term_counts_per_doc)

I have added self.test_term_counts_per_doc=deepcopy(term_counts_per_doc) and it make it able to call from vectorizer outside like this :

load_files = recursive_load_files
trainer_path = os.path.realpath(trainer_path)
tester_path = os.path.realpath(tester_path)
data_train = load_files(trainer_path, load_content = True, shuffle = False)


data_test = load_files(tester_path, load_content = True, shuffle = False)
print 'data loaded'

categories = None    # for case categories == None

print "%d documents (training set)" % len(data_train.data)
print "%d documents (testing set)" % len(data_test.data)
#print "%d categories" % len(categories)
print

# split a training set and a test set

print "Extracting features from the training dataset using a sparse vectorizer"
t0 = time()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.7,
                             stop_words='english',charset_error="ignore")

X_train = vectorizer.fit_transform(data_train.data)


print "done in %fs" % (time() - t0)
print "n_samples: %d, n_features: %d" % X_train.shape
print

print "Extracting features from the test dataset using the same vectorizer"
t0 = time()
X_test = vectorizer.transform(data_test.data)
print "Test printing terms per document"
for counter in vectorizer.test_term_counts_per_doc:
    print counter

here is my fork , i also submitted pull requests:

https://github.com/v3ss0n/scikit-learn

Please suggest me if there any better way to do.

Why -1 , its a working soultion (but need modification of scikit-learn) — Phyo Arkar Lwin, Nov 12 '12 at 14:29

How can i get highest frequency terms out of TD-idf vectors , for each files in scikit-learn?

2 Answers2

Linked