I'm working on a project in C# that uses Principal Component Analysis to apply feature reduction/dimension reduction on a [,]matrix. The matrix columns are features (words and bigrams) that have been extracted from a set emails. In the beginning we had around 156 emails which resulted in approximately 23000 terms and everything worked as it was supposed to using the following code:
public static double[,] GetPCAComponents(double[,] sourceMatrix, int dimensions = 20, AnalysisMethod method = AnalysisMethod.Center)
{
// Create Principal Component Analysis of a given source
PrincipalComponentAnalysis pca = new PrincipalComponentAnalysis(sourceMatrix, method);
// Compute the Principal Component Analysis
pca.Compute();
// Creates a projection of the information
double[,] pcaComponents = pca.Transform(sourceMatrix, dimensions);
// Return PCA Components
return pcaComponents;
}
The components we received were classified later on using Linear Discriminant Analysis' Classify method from the Accord.NET framework. Everything was working as it should.
Now that we have increased the size of out dataset (1519 emails and 68375 terms) we at first were getting some OutOfMemory Exceptions. We were able to solve this by adjusting some parts of our code until we were able to reach the part where we calculate the PCA components. Right now this takes about 45 minutes which is way too long. After checking the website of Accord.NET on PCA we decided to try and use the last example that uses a covariance matrix since it says: "Some users would like to analyze huge amounts of data. In this case, computing the SVD directly on the data could result in memory exceptions or excessive computing times". Therefore we changed our code to the following:
public static double[,] GetPCAComponents(double[,] sourceMatrix, int dimensions = 20, AnalysisMethod method = AnalysisMethod.Center)
{
// Compute mean vector
double[] mean = Accord.Statistics.Tools.Mean(sourceMatrix);
// Compute Covariance matrix
double[,] covariance = Accord.Statistics.Tools.Covariance(sourceMatrix, mean);
// Create analysis using the covariance matrix
var pca = PrincipalComponentAnalysis.FromCovarianceMatrix(mean, covariance);
// Compute the Principal Component Analysis
pca.Compute();
// Creates a projection of the information
double[,] pcaComponents = pca.Transform(sourceMatrix, dimensions);
// Return PCA Components
return pcaComponents;
}
This however raises an System.OutOfMemoryException. Does anyone know how to solve this problem?