4

I'm attempting to classify some inputs (text classification: 10,000+ examples, and 100,000+ features)

And I've read that using LibLinear is far faster / more memory efficient for such tasks, as such, I've ported my LibSvm classifier to accord/net, like so:

        //SVM Settings
        var teacher = new MulticlassSupportVectorLearning<Linear, Sparse<double>>()
        {
            //Using LIBLINEAR's L2-loss SVC dual for each SVM
            Learner = (p) => new LinearDualCoordinateDescent<Linear, Sparse<double>>()
            {
                Loss = Loss.L2,
                Complexity = 1,
            }
        };

        var inputs = allTerms.Select(t => new Sparse<double>(t.Sentence.Select(s => s.Index).ToArray(), t.Sentence.Select(s => (double)s.Value).ToArray())).ToArray();

        var classes = allTerms.Select(t => t.Class).ToArray();

        //Train the model
        var model = teacher.Learn(inputs, classes);

At the point of .Learn() - I get an instant OutOfMemoryExcpetion.

I've seen there's a CacheSize setting in the documentation, however, I cannot find where I can lower this setting, as is show in many examples.

One possible reason - I'm using the 'Hash trick' instead of indices - is Accord.Net attempting to allocate an array of the full hash space? (probably close to int.MaxValue) if so - is there any way to avoid this?

Any help is most appreciated!

Dave Bish
  • 19,263
  • 7
  • 46
  • 63

1 Answers1

1

Allocating hash space of 10000+ documents with 100000+ features will take at least 4 GB of memory, which may be limited by the AppDomain memory limit and CLR object size limit. Many projects by default are prefered to be built under 32-bit platform, which does not allow allocation of objects more than 2GB. I've managed to overcome this by removing 32-bit platform prefernce (go to project properties -> build and uncheck "Prefer 32-bit"). After that we should allow creation of objects more taking more than 2 GB or memory, add this line to your configuration file

<runtime>
    <gcAllowVeryLargeObjects enabled="true" />
</runtime>

Be aware that if you add this line but leave the 32-bit platform build preference you will still get an exception, as your project will not be able to allocate an array of such size

This is how you tune the CacheSize

//SVM Settings
    var teacher = new MulticlassSupportVectorLearning<Linear, Sparse<double>>()
    {
        Learner = (p) => new SequentialMinimalOptimization<Linear, Sparse<double>>()
        {
            CacheSize = 1000
            Complexity = 1,
        }
    };

    var inputs = allTerms.Select(t => new Sparse<double>(t.Sentence.Select(s => s.Index).ToArray(), t.Sentence.Select(s => (double)s.Value).ToArray())).ToArray();

    var classes = allTerms.Select(t => t.Class).ToArray();

    //Train the model
    var model = teacher.Learn(inputs, classes);

This way of constructing an SVM does cope with Sparse<double> data structure, but it is not using LibLinear. If you open Accord.NET repository and look at SVM solving algorithms with LibLinear support (LinearCoordinateDescent, LinearNewtonMethod) you will see no CacheSize property.

papadoble151
  • 646
  • 8
  • 19
  • Thanks for your answer - one question - does this use 'Liblinear' behind the scenes? or Libsvm? I was under the impression that using 'LinearDualCoordinateDescent' meant Liblinear was used behind the scenes (and is supposedly much faster) – Dave Bish Jun 15 '17 at 16:49
  • Also - I already stem very heavily - So I don't think a non-sparse implementation will work in my scenario – Dave Bish Jun 15 '17 at 16:50
  • @DaveBish see the updated the answer. It uses LibSvm just as previous implementation – papadoble151 Jun 16 '17 at 11:42
  • Methods that include "Linear" in their name use implementations from liblinear under the hood. Methods that do not include "Linear" in their name either use LibSVM implementations or have been implemented from scratch following a few research papers (i.e. SequentialMinimalOptimization). – Cesar Aug 10 '17 at 22:34
  • If allocating a huge array of documents for training is an issue for you (or anyone else reading this comment) please open a new issue in the project's issue tracker with a sample of the dataset you are trying to learn. It shouldn't be too difficult to extend the project to read samples directly from the disk if that would help solve this issue. – Cesar Aug 10 '17 at 22:35