How to use arrays in machine learning classes?

Question

I'm new to C++ and I think a good way for me to jump in is to build some basic models that I've built in other languages. I want to start with just Linear Regression solved using first order methods. So here's how I want things to be organized (in pseudocode).

class LinearRegression

    LinearRegression:
        tol = <a supplied tolerance or defaulted to 1e-5>
        max_ite = <a supplied max iter or default to 1k>

    fit(X, y):
        // model learns weights specific to this data set

    _gradient(X, y):
        // compute the gradient

    score(X,y):
        // model uses weights learned from fit to compute accuracy of 
        //   y_predicted to actual y

My question is when I use fit, score and gradient methods I don't actually need to pass around the arrays (X and y) or even store them anywhere so I want to use a reference or a pointer to those structures. My problem is that if the method accepts a pointer to a 2D array I need to supply the second dimension size ahead of time or use templating. If I use templating I now have something like this for every method that accepts a 2D array

template<std::size_t rows, std::size_t cols> 
void fit(double (&X)[rows][cols], double &y){...}

It seems there likely a better way. I want my regression class to work with any size input. How is this done in industry? I know in some situations the array is just flattened into row or column major format where just a pointer to the first element is passed but I don't have enough experience to know what people use in C++.

Usually we use `std::vector`. – molbdnilo Oct 08 '16 at 18:05 — molbdnilo, Oct 08 '16 at 18:05

score 2 · Accepted Answer · edited May 23 '17 at 12:00

You wrote a quite a few points in your question, so here are some points addressing them:

Contemporary C++ discourages working directly with heap-allocated data that you need to manually allocate or deallocate. You can use, e.g., std::vector<double> to represent vectors, and std::vector<std::vector<double>> to represent matrices. Even better would be to use a matrix class, preferably one that is already in mainstream use.
Once you use such a class, you can easily get the dimension at runtime. With std::vector, for example, you can use the size() method. Other classes have other methods. Check the documentation for the one you choose.
You probably really don't want to use templates for the dimensions.

a. If you do so, you will need to recompile each time you get a different input. Your code will be duplicated (by the compiler) to the number of different dimensions you simultaneously use. Lots of bad stuff, with little gain (in this case). There's no real drawback to getting the dimension at runtime from the class.

b. Templates (in your setting) are fitting for the type of the matrix (e.g., is it a matrix of doubles or floats), or possibly the number of dimesions (e.g., for specifying tensors).
Your regressor doesn't need to store the matrix and/or vector. Pass them by const reference. Your interface looks like that of sklearn. If you like, check the source code there. The result of calling fit just causes the class object to store the parameter corresponding to the prediction vector β. It doesn't copy or store the input matrix and/or vector.

How to use arrays in machine learning classes?

1 Answers1