Classifying URLs into categories - Machine Learning

Question

[I'm approaching this as an outsider to machine learning. It just seems like a classification problem which I should be able to solve with fairly good accuracy with Machine Larning.]

Training Dataset:
I have millions of URLs, each tagged with a particular category. There are limited number of categories (50-100).

Now given a fresh URL, I want to categorize it into one of those categories. The category can be determined from the URL using conventional methods, but would require a huge unmanageable mess of pattern matching.

So I want to build a box where INPUT is URL, OUTPUT is Category. How do I build this box driven by ML?

As much as I would love to understand the basic fundamentals of how this would work out mathematically, right now much much more focussed on getting it done, so a conceptual understanding of the systems and processes involved is what I'm looking to get. I suppose machine learning is at a point where you can approach reasonably straight forward problems in that manner.
If you feel I'm wrong and I need to understand the foundations deeply in order to get value out of ML, do let me know.

I'm building this inside an AWS ecosystem so I'm open to using Amazon ML if it makes things quicker and simpler.

score 1 · Answer 1 · answered May 01 '17 at 15:41

I suppose machine learning is at a point where you can approach reasonably straight forward problems in that manner.

It is not. Building an effective ML solution requires both an understanding of problem scope/constraints (in your case, new categories over time? Runtime requirements? Execution frequency? Latency requirements? Cost of errors? and more!). These constraints will then impact what types of feature engineering / processing you may look at, and what types of models you will look at. Your particular problem may also have issues with non I.I.D. data, which is an assumption of most ML methods. This would impact how you evaluate the accuracy of your model.

If you want to learn enough ML to do this problem, you might want to start looking at work done in Malicious URL classification. An example of which can be found here. While you could "hack" your way to something without learning more about ML, I would not personally trust any solution built in that manner.

score 1 · Answer 2 · edited May 23 '17 at 10:31

If you feel I'm wrong and I need to understand the foundations deeply in order to get value out of ML, do let me know.

Okay, I'll bite.

There are really two schools of thought currently related to prediction: "machine learners" versus statisticians. The former group focuses almost entirely on practical and applied prediction, using techniques like k-fold cross-validation, bagging, etc., while the latter group is focused more on statistical theory and research methods. You seem to fall into the machine-learning camp, which is fine, but then you say this:

As much as I would love to understand the basic fundamentals of how this would work out mathematically, right now much much more focussed on getting it done, so a conceptual understanding of the systems and processes involved is what I'm looking to get.

While a "conceptual understanding of the systems and processes involved" is a prerequisite for doing advanced analytics, it isn't sufficient if you're the one conducting the analysis (it would be sufficient for a manager, who's not as close to the modeling).

With just a general idea of what's going on, say, in a logistic regression model, you would likely throw all statistical assumptions (which are important) to the wind. Do you know whether certain features or groups shouldn't be included because there aren't enough observations in that group for the test statistic to be valid? What can happen to your predictions and hypotheses when you have high variance-inflation factors?

These are important considerations when doing statistics, and oftentimes people see how easy it is to do from sklearn.svm import SVC or somthing like that and run wild. That's how you get caught with your pants around your ankles.

How do I build this box driven by ML?

You don't seem to have even a rudimentary understanding of how to approach machine/statistical learning problems. I would highly recommend that you take an "Introduction to Statistical Learning"- or "Intro to Regression Modeling"-type course in order to think about how you translate the URLs you have into meaningful features that have significant power predicting URL class. Think about how you can decompose a URL into individual pieces that might give some information as to which class a certain URL pertains. If you're classifying espn.com domains by sport, it'd be pretty important to parse nba out of http://www.espn.com/nba/team/roster/_/name/cle, don't you think?

Good luck with your project.

Edit:

To nudge you along, though: every ML problem boils down to some function mapping input to output. Your outputs are URL classes. Your inputs are URLs. However, machines only understand numbers, right? URLs aren't numbers (AFAIK). So you'll need to find a way to translate information contained in the URLs to what we call "features" or "variables." One place to start, there, would be one-hot encoding different parts of each URL. Think of why I mentioned the ESPN example above, and why I extracted info like nba from the URL. I did that because, if I'm trying to predict to which sport a given URL pertains, nba is a dead giveaway (i.e. it would very likely be highly predictive of sport).

I think you got one major thing wrong about my question - the patterns of the URLs dictate the classification. There is no classification by 'type' involved. Anyway I'll think of some way to restructure my question better. Thanks for taking the time, some great tips. — user1265125, May 01 '17 at 19:56
I was looking for a conceptual understanding of how to get from - where I am now, to - where I'm able to effectively implement such a system. The advice I'm trying to ask StackOverflow is - how do I get there? So saying "You don't seem to have even a rudimentary understanding of how to approach machine/statistical learning problems" is unnecessarily rude don't you think. I'm literally asking exactly how to gain that rudimentary understanding. That's why I tried to stress I'm approaching this problem as an outsider, because I knew it's a little hard to dodge these kind of answers here on SO. — user1265125, May 01 '17 at 20:01
I'm not trying to be rude :) -- I just want to make sure *you* get a lot out of this whole process. My response is trying to point you in the right direction (i.e. go look *yourself* for resources on basic machine learning). You don't need to find an "URL-classification" example, necessarily; I think even a trivial project perhaps using the Titanic dataset would give you a good intro to **how to think about approaching these problems yourself**. What good would it do you if I provided a complete example that you would likely copy-paste for your data? You might not learn much that way. — blacksite, May 01 '17 at 20:07
To nudge you along, though: every ML problem boils down to some function mapping input to output. Your outputs are URL classes. Your inputs are URLs. However, machines only understand numbers, right? URLs aren't numbers (AFAIK). So you'll need to find a way to translate information contained in the URLs to what we call "features" or "variables." One place to start, there, would be [one-hot encoding](http://stackoverflow.com/questions/17469835/one-hot-encoding-for-machine-learning) different parts of each URL. Think of why I mentioned the ESPN example above, and why I extracted info like `nba` — blacksite, May 01 '17 at 20:11

Classifying URLs into categories - Machine Learning

2 Answers2