What classification algorithm should I use for document classification with this variables?

Question

I'm trying to classify pages, in particular search for a page, in documents based on bag of words, page layout, contain tables or not, has bold titles, etc. With this premise I have created a pandas.DataFrame like this, for each document:

    page  totalCharCount  matchesOfWordX  matchesOfWordY          hasFeaturesX     hasFeaturesY   hasTable      score
0    0.0           608.0             0.0             2.0                   0.0              0.0        0.0        0.0
1    1.0          3292.0             1.0            24.0                   7.0              0.0        0.0        0.0
2    2.0          3302.0             0.0            15.0                   1.0              0.0        1.0        0.0
3    3.0            26.0             0.0             0.0                   0.0              1.0        1.0        1.0
4    4.0          1851.0             3.0            25.0                  20.0              7.0        0.0        0.0
5    5.0          2159.0             0.0            27.0                   6.0              0.0        0.0        0.0
6    6.0          1906.0             0.0             9.0                  15.0              3.0        0.0        0.0
7    7.0          1825.0             0.0            24.0                   9.0              0.0        0.0        0.0
8    8.0          2053.0             0.0            20.0                  10.0              2.0        0.0        0.0
9    9.0          2082.0             2.0            16.0                   3.0              2.0        0.0        0.0
10  10.0          2206.0             0.0            30.0                   1.0              0.0        0.0        0.0
11  11.0          1746.0             3.0            31.0                   3.0              0.0        0.0        0.0
12  12.0          1759.0             0.0            38.0                   3.0              1.0        0.0        0.0
13  13.0          1790.0             0.0            21.0                   0.0              0.0        0.0        0.0
14  14.0          1759.0             0.0            11.0                   6.0              0.0        0.0        0.0
15  15.0          1539.0             0.0            20.0                   3.0              0.0        0.0        0.0
16  16.0          1891.0             0.0            13.0                   6.0              1.0        0.0        0.0
17  17.0          1101.0             0.0             4.0                   0.0              1.0        0.0        0.0
18  18.0          2247.0             0.0            16.0                   5.0              5.0        0.0        0.0
19  19.0           598.0             2.0             3.0                   1.0              1.0        0.0        0.0
20  20.0          1014.0             2.0             1.0                  16.0              3.0        0.0        0.0
21  21.0           337.0             1.0             2.0                   1.0              1.0        0.0        0.0
22  22.0           258.0             0.0             0.0                   0.0              0.0        0.0        0.0

I'm taking a look to Naive Bayes and SVM algorithms but I'm not sure which one fits better with the problem. The variables are independent. Some of them must be present to increase the score, and some of them matches the inverse document frequency, like totalCharCount.

Any help?

Thanks a lot!

Florian H · Answer 1 · 2017-10-19T12:20:48.150

Because of the continuous score, which i assume is your label, it's a regression problem. SVMs are more common for classification problems. There are lots of possible algorithms out there. Logistic Regression would be pretty common to solve something like this.

Edit

Now that you edited your post your problem became a classification problem :-)

Classification = some classes you want your data to classify as like boolean(True, False) or multinomial(Big, Middle, Small, Very Small)

Regression = continuous values(all real numbers between 0 and 1)

Now you can try your SVM and see if it works well enough for your data.

See @Maxim's answer he has some good points (balancing, scaling)

The column score is what I'm trying to predict. The values can be 0 or 1, match the page that I'm trying to find or not. I have edited the post. I will take a look to the logistic regression algorithm. Thanks Florian, for the response! — rePack, Oct 19 '17 at 11:15

score 0 · Answer 2 · answered Oct 19 '17 at 11:40

Generally, it's hard to say what method will work best: I assume you have a lot more data and the question depends on data a lot. But still, here're some ideas:

Though you are saying that features are independent, it seems like totalCharCount and matchesOfWordY are dependent. I think it's reasonable to assume that the more chars in the document, the more matches there likely to be. It's a strong sign against Naive Bayes.
A binary logistic regression looks much better and would be my first candidate. One suggestion though is to normalize totalCharCount feature, because its scale appears to be much larger than other features.
Unless you have much more training examples of class 1, your data is unballanced. If this is the case, you are likely to step into constant prediction problem. A possible solution is to use a weighted cross entropy loss function.
In addition to SVM classifier, consider also xgboost.XGBClassifier. Both of them can give very good accuracy.

What classification algorithm should I use for document classification with this variables?

2 Answers2