Given your problem description, the characteristics of yoru data, and your ML background and personal preferences, i would recommend Orange.
Orange is a mature, free and open source project with a large selection of ML algorithms and excellent documentation and training materials. Most users probably use the GUI supplied with Orange, but the framework is scriptable with Python.
Using this framework therefore, will of course enable you to quickly experiment with a variety of classifiers because (i) they are all in one place; and (ii) each is accessed a common configuration syntax GUI. All of the ML techniques within the Orange framework can be run in "demo" mode
one or more sample data sets supplied with the Orange install. The documentation supplied
in the Orange Install is excellent. In addition, the Home Page includes links to numerous
tutorials that cover probably every ML technique included in the framework.
Given your problem, perhaps begin with a Decision Tree algorithm (either C4.5 or ID3 implementation). A fairly recent edition of Dr. Dobbs Journal (online) includes an excellent article on using decision trees; the use case is web server data (from the server access log).
Orange has a C4.5 implementation, available from the GUI (as a "widget"). If that's too easy, about 100 lines is all it takes to code one in python. Here's the source for a working implementation in that language
I recommend starting with a Decision Tree for several reasons.
If it works on your data, you will
not only have a trained classifier,
but you will also have a visual
representation of the entire
classification schema (represented
as a binary tree). Decision Trees are (probably) unique among ML techniques in this respect.
The characteristics of your data are
aligned with the optimal performance
scenario of C4.5; the data can be
either categorical or continuous
variables (though this technique
performs better with if more
features (columns/fields) discrete
rather than continuous, which seems
to describe your data); also
Decision Tree algorithms can accept,
without any pre-processing,
incomplete data points
Simple data pre-processing. The data fed to a decision tree
algorithm does not require as much
data pre-processing as most other ML
techniques; pre-processing is often
(usually?) the most time-consuming
task in the entire ML workflow. It's
also sparsely documented, so it's
probably also the most likely source
of error.
You can deduce the (relative) weight of each variable from each node's distance from the root--in other words, from a quick visual
inspection of the trained
classifier. Recall that the trained classifier
is a just a binary tree (and is often rendered this way) in which the nodes
correspond to one value of one
feature (variable, or column in your
data set); the two edges joined to
that node of course represent the
data points split into two groups
based on each point's value for that
feature (e.g., if the feature is the
categorical variable "Publication
Date in HTML Page Head?", then
through the left edge will flow all
data points in which the
publication date is not within the
opening and closing head tags, and
the right node gets the other
group). What is the significance of
this? Since a node just represents
a state or value for a particular
variable, that variable's
importance (or weight) in
classifying the data can be deduced
from its position in the
tree--i.e., the closer it is to the
root node, the more important it is.
From your Question, it seems you have two tasks to complete before you can feed your training data to a ML classifier.
I. identify plausible class labels
What you want to predict is a date. Unless your resolution requirements are unusually strict (e.g., resolved to a single date) i would build a classification model (which returns a class label given a data point) rather than a regression model (returns a single continuous value).
Given that your response variable is a date, a straightforward approach is to set the earliest date to the baseline, 0, then represent all other dates as an integer value that represents the distance from that baseline. Next, discretize all dates into a small number of ranges. One very simple technique for doing this is to calculate the five summary descriptive statistics for your response variable (min, 1st quartile, mean, 3rd quartile, and max). From these five statistics, you get four sensibly chosen date ranges (though probably not of equal span or of equal membership size.
These four ranges of date values then represent your class labels--so for instance, classI might be all data points (web pages, i suppose) whose response variable (publication date) is 0 to 10 days after 0; classII is 11 days after 0 to 25 days after 0, etc.
[Note: added the code below in light of the OP's comment below this answer, requesting clarification.]
# suppose these are publication dates
>>> pd0 = "04-09-2011"
>>> pd1 = "17-05-2010"
# convert them to python datetime instances, e.g.,
>>> pd0 = datetime.strptime(pd0, "%d-%m-%Y")
# gather them in a python list and then call sort on that list:
>>> pd_all = [pd0, pd1, pd2, pd3, ...]
>>> pd_all.sort()
# 'sort' will perform an in-place sort on the list of datetime objects,
# such that the eariest date is at index 0, etc.
# now the first item in that list is of course the earliest publication date
>>> pd_all[0]
datetime.datetime(2010, 5, 17, 0, 0)
# express all dates except the earliest one as the absolute differenece in days
# from that earliest date
>>> td0 = pd_all[1] - pd_all[0] # t0 is a timedelta object
>>> td0
datetime.timedelta(475)
# convert the time deltas to integers:
>>> fnx = lambda v : int(str(v).split()[0])
>>> time_deltas = [td0,....]
# d is jsut a python list of integers representing number of days from a common baseline date
>>> d = map(fnx, time_deltas)
II. convert your raw data to an "ML-useable" form.
For a C4.5 classifier, this task is
far simpler and requires fewer steps than for probably every other ML algorithm. What's
preferred here is to discretize to a relatively small number of values,
as many of your parameters as possible--e.g., if one of your parameters/variables is
"distance of the publication date string from the closing body tag", then i would
suggest discretizing those values into ranges, as marketing surveys often ask
participants to report their age in one of a specified set of spans (18 - 35; 36 - 50, etc.)
rather than as a single integer (41).