How to predict a continuous value (time) from text documents?

Question

I have about 3000 text documents which are related to a duration of time when the document was "interesting". So lets say document 1 has 300 lines of text with content, which led to a duration of interest of 5.5 days, whereas another document with 40 lines of text led to a duration of 6.7 days being "interesting", and so on.

Now the task is to predict the duration of interest (which is a continuous value) based on the text content.

I have two ideas to approach the problem:

Build a model of similar documents with a technology like http://radimrehurek.com/gensim/simserver.html. When a new document arrives one could try to find the 10 most similar documents in the past and simply compute the average of their duration and take that value as prediction for the duration of interest for the new document.
Put the documents into categories of duration (e.g. 1 day, 2 days, 3-5 days, 6-10 days, ...). Then train a classifier to predict the category of duration based on the text content.

The advantage of idea #1 is that I could also calculate the standard deviation of my prediction, whereas with idea #2 it is less clear to me, how I could compute a similar measure of uncertainty of my prediction. Also it is unclear to me which categories to chose to get the best results from a classifier.

So is there a rule of thumb how to build a systems to best predict a continuous value like time from text documents? Should one use a classifier or should one use an approach using average values on similar documents? I have no real experience in that area and would like to know, which approach you think would probably yield the best results. Bonus point are given if you know a simple existing technology (Java or Python based) which could be used to solve this problem.

@larsmans: Why on the one hand you give an answer to this question, but on the other hand you vote for this question to be closed as off topic? — asmaier, Feb 26 '13 at 15:03

score 3 · Accepted Answer · answered Feb 26 '13 at 12:59

Approach (1) is called k-nearest neighbors regression. It's perfectly valid. So are myriad other approaches to regression, e.g. plain multiple regression using the documents' tokens as features.

Here's a skeleton script to fit a linear regression model using scikit-learn(*):

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDRegressor

# build a term-document matrix with tf-idf weights for the terms
vect = TfidfVectorizer(input="filename")
Xtrain = vect.fit_transform(documents)         # documents: list of filenames

# now set ytrain to a list of durations, such that ytrain[i] is the duration
# of documents[i]
ytrain = ...

# train a linear regression model using stochastic gradient descent (SGD)
regr = SGDRegressor()
regr.fit(Xtrain, ytrain)

That's it. If you now have new documents for which you want to predict the duration of interest, do

Xtest = vect.transform(new_documents)
ytest = regr.predict(Xtest)

This is a simple linear regression. In reality, I would expect interest duration to not be a linear function of a text's contents, but this might get you started. The next step would be to pick up any textbook on machine learning or statistics that treats more advanced regression models.

(*) I'm a contributor to this project, so this is not unbiased advice. Just about any half-decent machine learning toolkit has linear regression models.

Thank you for giving approach (1) a name: k-nearest neighbors regression. That helps me a lot. — asmaier, Feb 26 '13 at 16:17

score 1 · Answer 2 · edited May 23 '17 at 12:23

(The following is based on my academic "experience", but seems informative enough to post it).

It looks like your task can be reformulated as:

Given a training set of scored documents, design a system for scoring arbitrary documents based on their content.

"based on their content" is very ambiguous. In fact, I'd say it's too ambiguous. You could try to find a specific feature of those documents which seems to be responsible for the score. It's more of a human task until you can narrow it down, e.g. you know you're looking for certain "valuable" words which make up the score, or maybe groups of words (have a look at http://en.wikipedia.org/wiki/N-gram).

You might also try developing a search-engine-like system, based on a similarity measure, sim(doc1, doc2). However, you'd need a large corpus featuring all possible scores (from the lowest to the highest, multiple times), so for every input document, similiar documents would have a chance to exist. Otherwise, the results would be inconslusive.

Depending on what values sim() would return, the measure should fullfill a relationship like:

sim(doc1,doc2) == 1.0 - |score(doc1) - score(doc2)|.

To test the quality of the measure, you could compute the similarity and score difference for each pair of ducuments, and check the correlation.

The first pick would be the cosine similarity using tf-idf

You've also mentioned categorizing the data. It seems to me like a method "justifying" a poor similarity measure. I.e. if the measure is good, it should be clear which category the document would fall into. As for classifiers, your documents should first have some "features" defined.

If you had a large corpus of the documents, you could try clustering to speed up the process.

Lastly, to determine the final score, I would suggest processing the scores of a few most similar documents. A raw average might not be the best idea in this case, because "less similar" would also mean "less accurate".

As for implementation, have a look at: Simple implementation of N-Gram, tf-idf and Cosine similarity in Python.

(IMHO, 3000 documents is way too low number for doing anything reliable with it without further knowledge of their content or the relationship between the content and score.)

How to predict a continuous value (time) from text documents?

2 Answers2