I am relatively new to C#, and am completely puzzled by an OOM error that I get. I am trying to create a sparse matrix, and am therefore collecting triplets of (row index, column index, value). While going through the for loop, what ends up happening is that the actual physical memory used by the process (what Windows terms "Working Set" I believe, according to the resource manager) remains relatively fixed at around 3.5GB. However, the commit (which I believe is the virtual memory) keeps increasing and increasing until the commit limit is reached, upon which my program crashes with an OOM error.
The relevant code is below:
public SimMatrix(string sparseMethod, string simMethod, Phrases phrases, DistrFeature features, int topK) {
List<int> rows = new List<int>(phrases.uniquePhraseCount*topK);
List<int> cols = new List<int>(phrases.uniquePhraseCount*topK);
List<double> vals = new List<double>(phrases.uniquePhraseCount*topK);
if (sparseMethod.Equals("invIdx")) {
List<int> nonzeros = new List<int>(features.inverted_idx.Count());
List<int> neighbors = new List<int>(phrases.uniquePhraseCount);
List<double> simVals = new List<double>(phrases.uniquePhraseCount);
List<int> sortedIdx = new List<int>(phrases.uniquePhraseCount);
List<double> sortedSim = new List<double>(phrases.uniquePhraseCount);
for (int i = 0; i < phrases.uniquePhraseCount; i++) { //loop through all phrases
using (SparseDoubleArray row = phrases.feature_values.GetRowSparse(i))
{
if (phrases.feature_values.RowLength(i) > 0) { //i.e., at least one feature fired for phrase
nonzeros = (from pmi in row.Elements select pmi.IndexList[1]).ToList();
neighbors = generateNeighbors(nonzeros, features.inverted_idx);
foreach (int neighbor in neighbors)
simVals.Add(cosineSimilarity(row, phrases.feature_values.GetRowSparse(neighbor)));
var sortedIdxSim = neighbors.Zip(simVals, (a, b) => new { idx = a, sim = b }).OrderByDescending(pair => pair.sim);
sortedIdx = sortedIdxSim.Select(pair => pair.idx).ToList();
sortedSim = sortedIdxSim.Select(pair => pair.sim).ToList();
int topN = (sortedIdxSim.Count() < topK) ? sortedIdxSim.Count() : topK;
rows.AddRange(Enumerable.Repeat(i, topN).ToList());
cols.AddRange(sortedIdx.Take(topN).ToList());
vals.AddRange(sortedSim.Take(topN).ToList());
nonzeros.Clear();
neighbors.Clear();
simVals.Clear();
sortedIdx.Clear();
sortedSim.Clear();
}
else { //just add self similarity
rows.Add(i);
cols.Add(i);
vals.Add(1);
}
Console.WriteLine("{0} phrases done", i + 1);
}
}
}
else { Console.WriteLine("Sorry, no other sparsification method implemented thus far"); }
simMat = new SparseDoubleArray(phrases.uniquePhraseCount, phrases.uniquePhraseCount, rows, cols, vals);
}
static private List<int> generateNeighbors(List<int> idx, Dictionary<int, List<int>> inverted_idx) {
List<int> neighbors = new List<int>();
foreach (int feature in idx) {
neighbors.AddRange(inverted_idx[feature]);
neighbors = neighbors.Distinct().ToList();
}
return neighbors;
}
static public double cosineSimilarity(SparseDoubleArray profile1, SparseDoubleArray profile2) {
double numerator = profile1.Dot(profile2);
double norm1 = profile1.Norm();
double norm2 = profile2.Norm();
double cos_sim = numerator / (norm1 * norm2);
if (cos_sim > 0)
return cos_sim;
else
return 0;
}
Note that the code makes use of some internal libraries (e.g., SparseDoubleArray object). The basic gist is that I loop through all of my entries (indexed by i), and for each entry I figure out the non-zero column indices, from which I generate a list of potential neighbors through the "generateNeighbors" function. Once I have a list of candidate neighbors, I compute the cosine similarity for each of the potential neighbors. I then sort the indices and similarity values simultaneously, select the topN indices/similarity values, and add those, along with the index i (corresponding to the row index), to my lists that maintain the sparse matrix indices and values.
The code breaks seemingly non-deterministically while going through the for loop. Sometimes it breaks at i = 25,000, sometimes at i = 2000. I do not even get to the stage where I initialize my sparse matrix.
Any insights or help would be appreciated.
Update (June 10, 2013)
Thanks to the response provided, I have managed to drastically reduce the committed memory of my code. Below is the updated code, you will notice that it is not exactly the same as in the response to the question, and I will detail what I needed to change.
public SimMatrix(string sparseMethod, string simMethod, Phrases phrases, DistrFeature features, int topK) {
List<int> rows = new List<int>(phrases.uniquePhraseCount*topK);
List<int> cols = new List<int>(phrases.uniquePhraseCount*topK);
List<double> vals = new List<double>(phrases.uniquePhraseCount*topK);
if (sparseMethod.Equals("invIdx")) {
for (int i = 0; i < phrases.uniquePhraseCount; i++) { //loop through all phrases
using (SparseDoubleArray row = phrases.feature_values.GetRowSparse(i))
{
if (phrases.feature_values.RowLength(i) > 0) { //i.e., at least one feature fired for phrase
IEnumerable<int> nonzeros = from pmi in row.Elements select pmi.IndexList[1];
IEnumerable<int> neighbors = nonzeros.SelectMany(x => features.inverted_idx[x]).Distinct();
IEnumerable<double> simVals = neighbors.Select(x => cosineSimilarity(row, x, phrases));
var sortedIdxSim = neighbors.Zip(simVals, (a, b) => new { idx = a, sim = b }).OrderByDescending(pair => pair.sim).ToList();
//IEnumerable<int> sortedIdx = sortedIdxSim.Select(pair => pair.idx);
//IEnumerable<double> sortedSim = sortedIdxSim.Select(pair => pair.sim);
int sortedIdxSimCount = sortedIdxSim.Count;
int topN = (sortedIdxSimCount < topK) ? sortedIdxSimCount : topK;
rows.AddRange(Enumerable.Repeat(i, topN));
cols.AddRange(sortedIdxSim.Take(topN).Select(pair => pair.idx));
vals.AddRange(sortedIdxSim.Take(topN).Select(pair => pair.sim));
}
else { //just add self similarity
rows.Add(i);
cols.Add(i);
vals.Add(1);
}
if ((i % 1000) == 0)
Console.WriteLine("{0} phrases done;", i + 1);
}
}
}
else { Console.WriteLine("Sorry, no other sparsification method implemented thus far"); }
simMat = new SparseDoubleArray(phrases.uniquePhraseCount, phrases.uniquePhraseCount, rows, cols, vals);
}
static public double cosineSimilarity(SparseDoubleArray profile1, int profile2idx, Phrases phrases) {
using (SparseDoubleArray profile2 = phrases.feature_values.GetRowSparse(profile2idx)) {
double numerator = profile1.Dot(profile2);
double norm1 = profile1.Norm();
double norm2 = profile2.Norm();
double cos_sim = numerator / (norm1 * norm2);
if (cos_sim > 0)
return cos_sim;
else
return 0;
}
}
Firstly, I was forced to convert the var sortedIdxSim
from an IEnumerable to a List; this is because I a) need to know the number of elements in this list, and it seems calling .Count()
on an IEnumerable wipes out the data held in the IEnumerable? It also seems as if calling .Take()
on the IEnumerable<int> sortedIdx
(e.g., as per the original suggestion by Gjeltema) wipes out the data in IEnumerable<double> sortedSim
. Is this due to deferred execution? I'm not too familiar with lazy evaluation/deferred execution, so maybe I'm misunderstanding what I need to do here.
However, to be honest the current changes here have reduced my committed memory considerably such that the program can actually run to completion, so many thanks for that! If someone can help clarify the issue above for me, that would be great.