I'm trying to classify pages, in particular search for a page, in documents based on bag of words, page layout, contain tables or not, has bold titles, etc. With this premise I have created a pandas.DataFrame
like this, for each document:
page totalCharCount matchesOfWordX matchesOfWordY hasFeaturesX hasFeaturesY hasTable score 0 0.0 608.0 0.0 2.0 0.0 0.0 0.0 0.0 1 1.0 3292.0 1.0 24.0 7.0 0.0 0.0 0.0 2 2.0 3302.0 0.0 15.0 1.0 0.0 1.0 0.0 3 3.0 26.0 0.0 0.0 0.0 1.0 1.0 1.0 4 4.0 1851.0 3.0 25.0 20.0 7.0 0.0 0.0 5 5.0 2159.0 0.0 27.0 6.0 0.0 0.0 0.0 6 6.0 1906.0 0.0 9.0 15.0 3.0 0.0 0.0 7 7.0 1825.0 0.0 24.0 9.0 0.0 0.0 0.0 8 8.0 2053.0 0.0 20.0 10.0 2.0 0.0 0.0 9 9.0 2082.0 2.0 16.0 3.0 2.0 0.0 0.0 10 10.0 2206.0 0.0 30.0 1.0 0.0 0.0 0.0 11 11.0 1746.0 3.0 31.0 3.0 0.0 0.0 0.0 12 12.0 1759.0 0.0 38.0 3.0 1.0 0.0 0.0 13 13.0 1790.0 0.0 21.0 0.0 0.0 0.0 0.0 14 14.0 1759.0 0.0 11.0 6.0 0.0 0.0 0.0 15 15.0 1539.0 0.0 20.0 3.0 0.0 0.0 0.0 16 16.0 1891.0 0.0 13.0 6.0 1.0 0.0 0.0 17 17.0 1101.0 0.0 4.0 0.0 1.0 0.0 0.0 18 18.0 2247.0 0.0 16.0 5.0 5.0 0.0 0.0 19 19.0 598.0 2.0 3.0 1.0 1.0 0.0 0.0 20 20.0 1014.0 2.0 1.0 16.0 3.0 0.0 0.0 21 21.0 337.0 1.0 2.0 1.0 1.0 0.0 0.0 22 22.0 258.0 0.0 0.0 0.0 0.0 0.0 0.0
I'm taking a look to Naive Bayes and SVM algorithms but I'm not sure which one fits better with the problem. The variables are independent. Some of them must be present to increase the score, and some of them matches the inverse document frequency, like totalCharCount
.
Any help?
Thanks a lot!