How to efficiently retrieve top K-similar document by cosine similarity using python?

Question

I am handling one hundred thousand(100,000) documents(mean document length is about 500 terms). For each document, I want to get the top k (e.g. k = 5) similar documents by cosine similarity. So how to efficiently do this by Python.

Here is what I did:

for each document, do text segmentation, remove stop words, count term frequency(tf)
so we get tf matrix, about 100,000 docs * 600000 terms
do 1 - pairwise_distances(tf_matrix, metric = "cosine")
for each document, get top k similar documents.

I run my code on i5-2.5GHz, 12 hours passed but it still working. So i want to know how to optimize my code or procedure.

Here is my thought:

for each document, do feature selection, just keep terms whose tf > 1
do clustering first, then compute cosine similarity within each cluster
since i just need top k similar documents, do i need to compute all pairwise cosine similarity?
python GPU programming or paralleling programming?

So, do you have any good idea?

Many thanks.

I know there is a similar question, but that's not what i want.

UPDATE1

Thanks to @orange , After profiling, I found that step 2 was bottleneck! Here is the sample code:

def construct_dt_matrix():
    dt_matrix = pd.DataFrame(columns=['docid'])
    docid = 0
    for f in files:
        # text segmentation for f
        # remove stop words
        # word count store in cleaned_dict = {'word': tf}
        dt_matrix.loc[docid] = [0] * dt_matrix.shape[1] # add one row, init all 0
        dt_matrix.set_value(docid, 'docid', docid)
        for key, value in cleaned_dict.items():
            if key not in dt_matrix.columns.values:
                dt_matrix[key] = 0 # add one column, init all 0
            dt_matrix.set_value(docid, key, value) # bottleneck
        docid += 1

So, the bottleneck is adding new rows and columns to pandas. Any idea?

Have you tried it on a smaller dataset and perhaps used a profiler to find and optimise the hotspot in your code? Have a look at RunSnakeRun. — orange, Dec 24 '15 at 03:48
@orange Thanks to your advice, i found the bottleneck and have updated the description. Any idea? — user1024, Dec 24 '15 at 05:52
`self.dt_matrix.set_value(docid, key, value)` looks like a bug. This sets the same value over and over again (to index `docid` which gets incremented after `cleaned_dict` was iterated over and column `key`). — orange, Dec 24 '15 at 06:19
Perhaps read some tutorials on Pandas. Your understanding of it may not be accurate (many of them explain how it works and why it's fast which I think is required). — orange, Dec 24 '15 at 06:25
Sorry, code was extracted from a class, I have delete `self`. The loop is correct, i first add a new row filled with all 0, and for each key, fill the `key` column with `value`. Maybe it is inefficient to add row and column like this. Anyway, thanks. — user1024, Dec 24 '15 at 07:07
You may find it faster to use [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) than building your own matrices. — chthonicdaemon, Dec 26 '16 at 15:40

orange · Answer 1 · 2015-12-24T06:30:03.113

0

Pandas DataFrames (and the underlying numpy) are only really fast if you assign arrays of data at once. set_value requires a call for each cell in your matrix! You can do dt_matrix = pd.DataFrame(cleaned_dict) and you have a DataFrame with one function call (ignoring Pandas internal calls).

Try instead:

dt_matrix = pd.DataFrame()

for docid, f in enumerate(files):
    dt_matrix_file = pd.DataFrame(cleaned_dict)
    dt_matrix_file['docid'] = docid
    dt_matrix = dt_matrix.append(dt_matrix_file)

This should be orders of magnitude faster.

If you required NaN cells to be zero, you can do a dt_matrix.fillna(0) (again, one call instead of potentially n * m).

edited Dec 24 '15 at 06:30

answered Dec 24 '15 at 06:11

orange

7,755
14
75
139

Thanks first. I tried `DataFrame.append()`, it is really faster than `set_value`, but not that fast. Inspired by you, can we first get all column names, and just add new rows to DataFrame. Maybe `append` need `join`, so it still takes some time. – user1024 Dec 24 '15 at 09:19
It's not just `append` that makes it faster, but also the creation of the DataFrame. and avoiding to iterate over the dict. – orange Dec 24 '15 at 12:13

How to efficiently retrieve top K-similar document by cosine similarity using python?

UPDATE1

1 Answers1