I have about 3000 text documents which are related to a duration of time when the document was "interesting". So lets say document 1 has 300 lines of text with content, which led to a duration of interest of 5.5 days, whereas another document with 40 lines of text led to a duration of 6.7 days being "interesting", and so on.
Now the task is to predict the duration of interest (which is a continuous value) based on the text content.
I have two ideas to approach the problem:
- Build a model of similar documents with a technology like http://radimrehurek.com/gensim/simserver.html. When a new document arrives one could try to find the 10 most similar documents in the past and simply compute the average of their duration and take that value as prediction for the duration of interest for the new document.
- Put the documents into categories of duration (e.g. 1 day, 2 days, 3-5 days, 6-10 days, ...). Then train a classifier to predict the category of duration based on the text content.
The advantage of idea #1 is that I could also calculate the standard deviation of my prediction, whereas with idea #2 it is less clear to me, how I could compute a similar measure of uncertainty of my prediction. Also it is unclear to me which categories to chose to get the best results from a classifier.
So is there a rule of thumb how to build a systems to best predict a continuous value like time from text documents? Should one use a classifier or should one use an approach using average values on similar documents? I have no real experience in that area and would like to know, which approach you think would probably yield the best results. Bonus point are given if you know a simple existing technology (Java or Python based) which could be used to solve this problem.