1

I'm working on a project in C# that uses Principal Component Analysis to apply feature reduction/dimension reduction on a [,]matrix. The matrix columns are features (words and bigrams) that have been extracted from a set emails. In the beginning we had around 156 emails which resulted in approximately 23000 terms and everything worked as it was supposed to using the following code:

public static double[,] GetPCAComponents(double[,] sourceMatrix, int dimensions = 20, AnalysisMethod method = AnalysisMethod.Center) 
{
    // Create Principal Component Analysis of a given source
    PrincipalComponentAnalysis pca = new PrincipalComponentAnalysis(sourceMatrix, method);

    // Compute the Principal Component Analysis
    pca.Compute();

    // Creates a projection of the information
    double[,] pcaComponents = pca.Transform(sourceMatrix, dimensions);

    // Return PCA Components
    return pcaComponents;
}

The components we received were classified later on using Linear Discriminant Analysis' Classify method from the Accord.NET framework. Everything was working as it should.

Now that we have increased the size of out dataset (1519 emails and 68375 terms) we at first were getting some OutOfMemory Exceptions. We were able to solve this by adjusting some parts of our code until we were able to reach the part where we calculate the PCA components. Right now this takes about 45 minutes which is way too long. After checking the website of Accord.NET on PCA we decided to try and use the last example that uses a covariance matrix since it says: "Some users would like to analyze huge amounts of data. In this case, computing the SVD directly on the data could result in memory exceptions or excessive computing times". Therefore we changed our code to the following:

public static double[,] GetPCAComponents(double[,] sourceMatrix, int dimensions = 20, AnalysisMethod method = AnalysisMethod.Center) 
    {
        // Compute mean vector
        double[] mean = Accord.Statistics.Tools.Mean(sourceMatrix);

        // Compute Covariance matrix
        double[,] covariance = Accord.Statistics.Tools.Covariance(sourceMatrix, mean);

        // Create analysis using the covariance matrix
        var pca = PrincipalComponentAnalysis.FromCovarianceMatrix(mean, covariance);

        // Compute the Principal Component Analysis
        pca.Compute();

        // Creates a projection of the information
        double[,] pcaComponents = pca.Transform(sourceMatrix, dimensions);

        // Return PCA Components
        return pcaComponents;
    }

This however raises an System.OutOfMemoryException. Does anyone know how to solve this problem?

Redesign1991
  • 137
  • 1
  • 3
  • 12
  • Go x64 and mount more memory? – xanatos May 08 '15 at 11:31
  • How big is the input array source matrix? Is the exception also raised if you only provide one entry? – Romano Zumbé May 08 '15 at 11:36
  • I have a quad CPU Q9300 2.50GHz - 8GB RAM and 64 bit Operating System, so I'm not sure whether this is the problem or not. When using the 1st code sample computation doesn't thrown an error but takes +- 45 mins. Using the covariance matrix should be better suited but throws the error. – Redesign1991 May 08 '15 at 11:38
  • 1
    Have you profiled the app to know where the memory is going? PCA is an eigenvalue problem. You need to find an algorithm that work for large, full matricies. Better yet, you could benefit from a parallel algorithm to cut down on that 45 minute wait time. http://elpa-lib.fhi-berlin.mpg.de/wiki/index.php/Main_Page – duffymo May 08 '15 at 11:38
  • The size is sourceMatrix[529, 34482] (this is not even the full dataset I'm passing on now, this is double it's size). The mean is [34482]. – Redesign1991 May 08 '15 at 11:40
  • @Redesign1991 Is your app running at 64 bits? (`bool is64 = Environment.Is64BitProcess`) – xanatos May 08 '15 at 11:49

2 Answers2

0

I think parallelizing your solver is the best bet.

Perhaps something like FEAST would help.

http://www.ecs.umass.edu/~polizzi/feast/

Parallel linear algebra for multicore system

Community
  • 1
  • 1
duffymo
  • 305,152
  • 44
  • 369
  • 561
0

The problem is that the code is using jagged matrices instead of multi-dimensional matrices. The point is that double[,] needs a contiguous amount of memory to be allocated, which may be quite hard to find depending on much space you need. If you use jagged matrices, memory allocations are spread out and space is easier to find.

You can avoid this issue by upgrading to the latest version of the framework and using the new API for statistical analysis instead. Instead of passing your source matrix in the constructor and calling .Compute, simply call .Learn() instead:

public static double[][] GetPCAComponents(double[][] sourceMatrix, int dimensions = 20, AnalysisMethod method = AnalysisMethod.Center) 
{
    // Create Principal Component Analysis of a given source
    PrincipalComponentAnalysis pca = new PrincipalComponentAnalysis(method)
    {
        NumberOfOutputs = dimensions // limit the number of dimensions
    };

    // Compute the Principal Component Analysis
    pca.Learn(sourceMatrix);

    // Creates a projection of the information
    double[][] pcaComponents = pca.Transform(sourceMatrix);

    // Return PCA Components
    return pcaComponents;
}
Cesar
  • 2,059
  • 25
  • 30
  • I'm trying to implement the class you've provided. Currently receiving 1 error on: eigenvectors = Matrix.Sort(eigenvalues, eigenvectors, new GeneralComparer(ComparerDirection.Descending, true)); says the type arguments cannot be inferred from the usage. Try specifying implicitly. Have Accord.Math namespace referenced – Redesign1991 May 08 '15 at 14:26
  • Oops, I forgot to include this method. Here it is: https://gist.github.com/cesarsouza/aeb3c080c502ea5702b5 – Cesar May 09 '15 at 12:01