Machine learning for weighting adjustment

Question

I'm trying to work out how to implement some machine learning library to help me find out what the correct weighting for each parameter is in order to make a good decision.

In more detail:

Context: trying to implement a date of publication extractor for html files. This is for news sites, so I don't have a generic date format that I can use. I'm using the parser in dateutil in python, which does a pretty good job. I end up with a list of possible publication dates (all the dates in the html file).

From a set of parameters, such as close tags, words close to the date substring, etc. I sort the list according to likelihood of being the publication date. The weighting for each parameter are somehow educated guesses.

I would like to implement a machine learning algorithm that, after a training period (in which the actual publication date is provided), it determines what the weighting for each parameter should be.

I've been reading the documentation of different machine learning libraries in python (pyML, scikit-learn, pybrain), but I haven't found anything useful. I've also read this and there's a close example with determining if a mushroom is eadible or not.

Note: I'm working in python.

I would very much appreciate your help.

There's an algorithm outlined on slide #10. Is this what you need help to implement? — inspectorG4dget, Jul 05 '11 at 16:30
I need help in doing the task. By using PyML, or any other machine learning library. Right now, I'm not that interested in how it works. — dud5, Jul 05 '11 at 16:31
Alright, you have parameters, which you can weight. But do you have enough labeled data, so that you can make a training and a test set? — inspectorG4dget, Jul 05 '11 at 16:34
If by labeled you mean the html with the correct publication date, not right now. I was thinking of outsourcing that when I have some code working. — dud5, Jul 05 '11 at 16:37

doug · Accepted Answer · 2011-07-06T02:23:58.437

Given your problem description, the characteristics of yoru data, and your ML background and personal preferences, i would recommend Orange.

Orange is a mature, free and open source project with a large selection of ML algorithms and excellent documentation and training materials. Most users probably use the GUI supplied with Orange, but the framework is scriptable with Python.

Using this framework therefore, will of course enable you to quickly experiment with a variety of classifiers because (i) they are all in one place; and (ii) each is accessed a common configuration syntax GUI. All of the ML techniques within the Orange framework can be run in "demo" mode one or more sample data sets supplied with the Orange install. The documentation supplied in the Orange Install is excellent. In addition, the Home Page includes links to numerous tutorials that cover probably every ML technique included in the framework.

Given your problem, perhaps begin with a Decision Tree algorithm (either C4.5 or ID3 implementation). A fairly recent edition of Dr. Dobbs Journal (online) includes an excellent article on using decision trees; the use case is web server data (from the server access log).

Orange has a C4.5 implementation, available from the GUI (as a "widget"). If that's too easy, about 100 lines is all it takes to code one in python. Here's the source for a working implementation in that language

I recommend starting with a Decision Tree for several reasons.

If it works on your data, you will not only have a trained classifier, but you will also have a visual representation of the entire classification schema (represented as a binary tree). Decision Trees are (probably) unique among ML techniques in this respect.
The characteristics of your data are aligned with the optimal performance scenario of C4.5; the data can be either categorical or continuous variables (though this technique performs better with if more features (columns/fields) discrete rather than continuous, which seems to describe your data); also Decision Tree algorithms can accept, without any pre-processing, incomplete data points
Simple data pre-processing. The data fed to a decision tree algorithm does not require as much data pre-processing as most other ML techniques; pre-processing is often (usually?) the most time-consuming task in the entire ML workflow. It's also sparsely documented, so it's probably also the most likely source of error.
You can deduce the (relative) weight of each variable from each node's distance from the root--in other words, from a quick visual inspection of the trained classifier. Recall that the trained classifier is a just a binary tree (and is often rendered this way) in which the nodes correspond to one value of one feature (variable, or column in your data set); the two edges joined to that node of course represent the data points split into two groups based on each point's value for that feature (e.g., if the feature is the categorical variable "Publication Date in HTML Page Head?", then through the left edge will flow all data points in which the publication date is not within the opening and closing head tags, and the right node gets the other group). What is the significance of this? Since a node just represents a state or value for a particular variable, that variable's importance (or weight) in classifying the data can be deduced from its position in the tree--i.e., the closer it is to the root node, the more important it is.

From your Question, it seems you have two tasks to complete before you can feed your training data to a ML classifier.

I. identify plausible class labels

What you want to predict is a date. Unless your resolution requirements are unusually strict (e.g., resolved to a single date) i would build a classification model (which returns a class label given a data point) rather than a regression model (returns a single continuous value).

Given that your response variable is a date, a straightforward approach is to set the earliest date to the baseline, 0, then represent all other dates as an integer value that represents the distance from that baseline. Next, discretize all dates into a small number of ranges. One very simple technique for doing this is to calculate the five summary descriptive statistics for your response variable (min, 1st quartile, mean, 3rd quartile, and max). From these five statistics, you get four sensibly chosen date ranges (though probably not of equal span or of equal membership size.

These four ranges of date values then represent your class labels--so for instance, classI might be all data points (web pages, i suppose) whose response variable (publication date) is 0 to 10 days after 0; classII is 11 days after 0 to 25 days after 0, etc.

[Note: added the code below in light of the OP's comment below this answer, requesting clarification.]

# suppose these are publication dates
>>> pd0 = "04-09-2011"      
>>> pd1 = "17-05-2010"
# convert them to python datetime instances, e.g., 
>>> pd0 = datetime.strptime(pd0, "%d-%m-%Y")
# gather them in a python list and then call sort on that list:
>>> pd_all = [pd0, pd1, pd2, pd3, ...]
>>> pd_all.sort()
# 'sort' will perform an in-place sort on the list of datetime objects,
# such that the eariest date is at index 0, etc.
# now the first item in that list is of course the earliest publication date
>>> pd_all[0]
datetime.datetime(2010, 5, 17, 0, 0)
# express all dates except the earliest one as the absolute differenece in days
# from that earliest date
>>> td0 = pd_all[1] - pd_all[0]           # t0 is a timedelta object
>>> td0
datetime.timedelta(475)     
# convert the time deltas to integers:
>>> fnx = lambda v : int(str(v).split()[0])
>>> time_deltas = [td0,....]
# d is jsut a python list of integers representing number of days from a common baseline date
>>> d = map(fnx, time_deltas)

II. convert your raw data to an "ML-useable" form.

For a C4.5 classifier, this task is far simpler and requires fewer steps than for probably every other ML algorithm. What's preferred here is to discretize to a relatively small number of values, as many of your parameters as possible--e.g., if one of your parameters/variables is "distance of the publication date string from the closing body tag", then i would suggest discretizing those values into ranges, as marketing surveys often ask participants to report their age in one of a specified set of spans (18 - 35; 36 - 50, etc.) rather than as a single integer (41).

I can't express how thankful I am as you really worked on the response. It'll take me some time to understand all you've written, but thanks again. I found out about Orange after publishing my questions, and this is what came to my mind: 1.Prepare a data set for which the attributes are the parameters (as booleans) I want the weights of. 2.The class (I'm not sure this is the right word) would be whether the date is a publication_date. 3.Train it. 4.Sort the list of possible publication dates in a HTML file by the probabilities of each being "publication_date). Does this sound plausible? — dud5, Jul 05 '11 at 23:44
"then represent all other dates as an integer value that represents the distance from that baseline" What would the integers be? I'm not really following this part. I'll read it again tomorrow and research. It's quite late here in Europe. — dud5, Jul 05 '11 at 23:51
i'm happy that you found my answer helpful (if it is the answer you prefer, then please mark it as the "accepted answer"). i have supplemented my response under "I. Identify...." with python code, so you can ignore that confusing sentence of mine. — doug, Jul 06 '11 at 02:11

score 2 · Answer 2 · answered Jul 05 '11 at 21:26

2

Assuming you need machine learning (document set is sufficiently large, number of news sites is large enough that writing parsers on a per-site basis is unwieldy, URLs don't contain any obvious publication date markers, HTTP Last-Modified headers are unreliable, etc.) - you might consider an approach like:

use decision trees, Bayesian learning, or whatever to turn your textual features into features of the "what do I think the probability of this date being a publication date given..." sort;
use maxent learning to compute weights (for which, e.g., http://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html is useful.)

answered Jul 05 '11 at 21:26

fearlesstost

866
6
7

If I don't really need the weights but to find out which date in a HTML is most likely the publication date (weighting parameters is just one way, I guess), would the method I've described in my comment to the previous response be okay? (sorting according to the probability of each date being the publication date). – dud5 Jul 05 '11 at 23:55
This is a ranking problem. What you call "probability" is really a scoring function; for each date, you want to assign a "publication-dateiness" score, which you might (or might not) want to think of as the probability (according to some model) that the date is actually the publication date. Once you assign that score in an accurate and meaningful way, then yes - sorting these in decreasing order of "publication-dateiness" is what you want. Computing this score is the difficult part ;) – fearlesstost Jul 06 '11 at 00:59

Machine learning for weighting adjustment

2 Answers2