1

I am relatively new to C#, and am completely puzzled by an OOM error that I get. I am trying to create a sparse matrix, and am therefore collecting triplets of (row index, column index, value). While going through the for loop, what ends up happening is that the actual physical memory used by the process (what Windows terms "Working Set" I believe, according to the resource manager) remains relatively fixed at around 3.5GB. However, the commit (which I believe is the virtual memory) keeps increasing and increasing until the commit limit is reached, upon which my program crashes with an OOM error.

The relevant code is below:

public SimMatrix(string sparseMethod, string simMethod, Phrases phrases, DistrFeature features, int topK) {
        List<int> rows = new List<int>(phrases.uniquePhraseCount*topK);
        List<int> cols = new List<int>(phrases.uniquePhraseCount*topK);
        List<double> vals = new List<double>(phrases.uniquePhraseCount*topK);
        if (sparseMethod.Equals("invIdx")) {
            List<int> nonzeros = new List<int>(features.inverted_idx.Count());
            List<int> neighbors = new List<int>(phrases.uniquePhraseCount);
            List<double> simVals = new List<double>(phrases.uniquePhraseCount);
            List<int> sortedIdx = new List<int>(phrases.uniquePhraseCount);
            List<double> sortedSim = new List<double>(phrases.uniquePhraseCount);                
            for (int i = 0; i < phrases.uniquePhraseCount; i++) { //loop through all phrases
                using (SparseDoubleArray row = phrases.feature_values.GetRowSparse(i)) 
                {
                    if (phrases.feature_values.RowLength(i) > 0) { //i.e., at least one feature fired for phrase                                                    
                        nonzeros = (from pmi in row.Elements select pmi.IndexList[1]).ToList();                            
                        neighbors = generateNeighbors(nonzeros, features.inverted_idx);
                        foreach (int neighbor in neighbors)
                            simVals.Add(cosineSimilarity(row, phrases.feature_values.GetRowSparse(neighbor)));
                        var sortedIdxSim = neighbors.Zip(simVals, (a, b) => new { idx = a, sim = b }).OrderByDescending(pair => pair.sim);                            
                        sortedIdx = sortedIdxSim.Select(pair => pair.idx).ToList();                            
                        sortedSim = sortedIdxSim.Select(pair => pair.sim).ToList();                            
                        int topN = (sortedIdxSim.Count() < topK) ? sortedIdxSim.Count() : topK;
                        rows.AddRange(Enumerable.Repeat(i, topN).ToList());
                        cols.AddRange(sortedIdx.Take(topN).ToList());
                        vals.AddRange(sortedSim.Take(topN).ToList());
                        nonzeros.Clear();
                        neighbors.Clear();
                        simVals.Clear();
                        sortedIdx.Clear();
                        sortedSim.Clear();
                    }
                    else { //just add self similarity
                        rows.Add(i);
                        cols.Add(i);
                        vals.Add(1);
                    }
                    Console.WriteLine("{0} phrases done", i + 1);                         
                }
            }
        }
        else { Console.WriteLine("Sorry, no other sparsification method implemented thus far"); }
        simMat = new SparseDoubleArray(phrases.uniquePhraseCount, phrases.uniquePhraseCount, rows, cols, vals); 
    }

    static private List<int> generateNeighbors(List<int> idx, Dictionary<int, List<int>> inverted_idx) {
        List<int> neighbors = new List<int>();
        foreach (int feature in idx) {
            neighbors.AddRange(inverted_idx[feature]);
            neighbors = neighbors.Distinct().ToList();
        }            
        return neighbors;            
    }

    static public double cosineSimilarity(SparseDoubleArray profile1, SparseDoubleArray profile2) {
        double numerator = profile1.Dot(profile2);
        double norm1 = profile1.Norm();
        double norm2 = profile2.Norm();
        double cos_sim = numerator / (norm1 * norm2);
        if (cos_sim > 0)
            return cos_sim;
        else
            return 0;            
    }

Note that the code makes use of some internal libraries (e.g., SparseDoubleArray object). The basic gist is that I loop through all of my entries (indexed by i), and for each entry I figure out the non-zero column indices, from which I generate a list of potential neighbors through the "generateNeighbors" function. Once I have a list of candidate neighbors, I compute the cosine similarity for each of the potential neighbors. I then sort the indices and similarity values simultaneously, select the topN indices/similarity values, and add those, along with the index i (corresponding to the row index), to my lists that maintain the sparse matrix indices and values.

The code breaks seemingly non-deterministically while going through the for loop. Sometimes it breaks at i = 25,000, sometimes at i = 2000. I do not even get to the stage where I initialize my sparse matrix.

Any insights or help would be appreciated.

Update (June 10, 2013)

Thanks to the response provided, I have managed to drastically reduce the committed memory of my code. Below is the updated code, you will notice that it is not exactly the same as in the response to the question, and I will detail what I needed to change.

public SimMatrix(string sparseMethod, string simMethod, Phrases phrases, DistrFeature features, int topK) {
        List<int> rows = new List<int>(phrases.uniquePhraseCount*topK);
        List<int> cols = new List<int>(phrases.uniquePhraseCount*topK);
        List<double> vals = new List<double>(phrases.uniquePhraseCount*topK);
        if (sparseMethod.Equals("invIdx")) {
            for (int i = 0; i < phrases.uniquePhraseCount; i++) { //loop through all phrases
                using (SparseDoubleArray row = phrases.feature_values.GetRowSparse(i)) 
                {
                    if (phrases.feature_values.RowLength(i) > 0) { //i.e., at least one feature fired for phrase                                                                                
                        IEnumerable<int> nonzeros = from pmi in row.Elements select pmi.IndexList[1];
                        IEnumerable<int> neighbors = nonzeros.SelectMany(x => features.inverted_idx[x]).Distinct();                            
                        IEnumerable<double> simVals = neighbors.Select(x => cosineSimilarity(row, x, phrases));
                        var sortedIdxSim = neighbors.Zip(simVals, (a, b) => new { idx = a, sim = b }).OrderByDescending(pair => pair.sim).ToList();
                        //IEnumerable<int> sortedIdx = sortedIdxSim.Select(pair => pair.idx);                                                        
                        //IEnumerable<double> sortedSim = sortedIdxSim.Select(pair => pair.sim);                                                        
                        int sortedIdxSimCount = sortedIdxSim.Count;
                        int topN = (sortedIdxSimCount < topK) ? sortedIdxSimCount : topK;                            
                        rows.AddRange(Enumerable.Repeat(i, topN));
                        cols.AddRange(sortedIdxSim.Take(topN).Select(pair => pair.idx));
                        vals.AddRange(sortedIdxSim.Take(topN).Select(pair => pair.sim)); 
                    }
                    else { //just add self similarity
                        rows.Add(i);
                        cols.Add(i);
                        vals.Add(1);
                    }
                    if ((i % 1000) == 0)
                        Console.WriteLine("{0} phrases done;", i + 1);                         
                }
            }
        }
        else { Console.WriteLine("Sorry, no other sparsification method implemented thus far"); }
        simMat = new SparseDoubleArray(phrases.uniquePhraseCount, phrases.uniquePhraseCount, rows, cols, vals); 
    }

    static public double cosineSimilarity(SparseDoubleArray profile1, int profile2idx, Phrases phrases) {
        using (SparseDoubleArray profile2 = phrases.feature_values.GetRowSparse(profile2idx)) {
            double numerator = profile1.Dot(profile2);
            double norm1 = profile1.Norm();
            double norm2 = profile2.Norm();
            double cos_sim = numerator / (norm1 * norm2);
            if (cos_sim > 0)
                return cos_sim;
            else
                return 0;
        }
    }

Firstly, I was forced to convert the var sortedIdxSim from an IEnumerable to a List; this is because I a) need to know the number of elements in this list, and it seems calling .Count() on an IEnumerable wipes out the data held in the IEnumerable? It also seems as if calling .Take() on the IEnumerable<int> sortedIdx (e.g., as per the original suggestion by Gjeltema) wipes out the data in IEnumerable<double> sortedSim. Is this due to deferred execution? I'm not too familiar with lazy evaluation/deferred execution, so maybe I'm misunderstanding what I need to do here.

However, to be honest the current changes here have reduced my committed memory considerably such that the program can actually run to completion, so many thanks for that! If someone can help clarify the issue above for me, that would be great.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Avneesh
  • 511
  • 1
  • 4
  • 13
  • I just added an edit to the bottom of my answer - make sure you check that out and respond back if you're still having memory issues after trying my code changes. – Gjeltema Jun 09 '13 at 04:32
  • It's odd that the `Count()` and `Take()` are "wiping out" the values - they shouldn't be doing that. I've reviewed the code and I don't see why it would at this point - it would just iterate over the collections multiple times, but you should still get the desired values. I'd have to have the full code and debug it to really determine why it's doing that. That said, calling `ToList()` on the `sortedIdxSim` shouldn't hurt, and deferring really didn't help much with memory for it, similar to `neighbors` (I almost just called `ToList()` for it in my answer code). – Gjeltema Jun 11 '13 at 01:36

1 Answers1

2

One problem is that you're declaring a bunch of temporary collections early, and initializing them with what looks to be sizes that are well beyond what they will actually need. Then you go on to discard that memory you've allocated to them by assigning them other values. This probably isn't your major issue since you're discarding that initialized collection almost immediately (freeing it up for garbage collection), but I'm sure it's not helping.

For example, you initialize neighbors like so:

List<int> neighbors = new List<int>(phrases.uniquePhraseCount);

Then at the first usage of neighbors, you assign it a new collection, discarding the memory you've allocated for it:

neighbors = generateNeighbors(nonzeros, features.inverted_idx);

So, first thing is you'll want to get rid of all those early initializations you're not using, which probably take up a sizable chunk of memory.

Next thing is you're using Linq statements a lot, presumably for readability and ease of getting the data you want, which is great. But, you're not taking advantage of one of Linq's features (especially in low memory situations) by calling .ToList() on everything, instead of lazy loading it all.

I've gone over your function and removed the initialzations I mentioned above, and also changed as much as reasonable to be lazy loaded (i.e. removed .ToList() calls).

(Note that I left the .ToList() call for the neighbors initialization as you won't gain much by not doing so I think (it's hard to tell how large neighbors will be from this code). If you still have memory problems though, I'd suggest changing the return type of generateNeighbors() to IEnumerable and remove the .ToList() inside it and try that).

This should reduce your peak memory usage considerably. If you still have memory problems, come back and update your question - I'll probably need to see more code and get more information about what sort of numbers you're running this with at that point.

// Side note - your simMethod argument doesn't seem to be used.
public SimMatrix(string sparseMethod, string simMethod, Phrases phrases, DistrFeature features, int topK)
{
    List<int> rows = new List<int>(phrases.uniquePhraseCount * topK);
    List<int> cols = new List<int>(phrases.uniquePhraseCount * topK);
    List<double> vals = new List<double>(phrases.uniquePhraseCount * topK);
    if (sparseMethod.Equals("invIdx"))
    {
        for (int i = 0; i < phrases.uniquePhraseCount; i++)
        { //loop through all phrases
            using (SparseDoubleArray row = phrases.feature_values.GetRowSparse(i))
            {
                if (phrases.feature_values.RowLength(i) > 0)
                { //i.e., at least one feature fired for phrase
                    // Declare your temporary collections when they're initialized
                    IEnumerable<int> nonzeros = row.Elements.Select(pmi => pmi.IndexList[1]);
                    var neighbors = generateNeighbors(nonzeros, features.inverted_idx);
                    IEnumerable<double> simVals = neighbors.Select(x => cosineSimilarity(row, phrases.feature_values.GetRowSparse(x)));
                    var sortedIdxSim = neighbors.Zip(simVals, (a, b) => new { idx = a, sim = b }).OrderByDescending(pair => pair.sim);
                    IEnumerable<int> sortedIdx = sortedIdxSim.Select(pair => pair.idx);
                    IEnumerable<double> sortedSim = sortedIdxSim.Select(pair => pair.sim);
                    int sortedInxSimCount = sortedIdxSim.Count();
                    int topN = (sortedInxSimCount < topK) ? sortedInxSimCount : topK;
                    rows.AddRange(Enumerable.Repeat(i, topN));
                    cols.AddRange(sortedIdx.Take(topN));
                    vals.AddRange(sortedSim.Take(topN));
                }
                else
                { //just add self similarity
                    rows.Add(i);
                    cols.Add(i);
                    vals.Add(1);
                }
                Console.WriteLine("{0} phrases done", i + 1);
            }
        }
    }
    else { Console.WriteLine("Sorry, no other sparsification method implemented thus far"); }
    simMat = new SparseDoubleArray(phrases.uniquePhraseCount, phrases.uniquePhraseCount, rows, cols, vals);
}

static private List<int> generateNeighbors(IEnumerable<int> idx, Dictionary<int, List<int>> inverted_idx)
{
    // Doing it this way will reduce memory usage since you won't be creating a bunch of temporary
    // collections, adding them to an existing collection, then creating a brand new collection
    // from it that is smaller... I think that may have been spiking your memory usage quite a bit.
    return inverted_idx.Where(x => idx.Contains(x.Key)).SelectMany(x => x.Value).Distinct().ToList();
}

One final note - you seem to be adding what looks to be the same values to at least rows, and possibly cols and vals. Are you needing duplicate values in these collections (it looks like you may actually)? This isn't really an issue if they don't ever exceed their initialized capacity (phrases.uniquePhraseCount * topK), but if they do, then that may be your main memory problem.

Edit: I just noticed something else. What is the SparseDoubleArray class, and what does GetRowSparse() do?

Specifically I'm wondering why you're doing a using on the class, like so:

using (SparseDoubleArray row = phrases.feature_values.GetRowSparse(i)) 

which forces it to release its native resources after the block is done. BUT, you're also calling it here:

simVals.Add(cosineSimilarity(row, phrases.feature_values.GetRowSparse(neighbor)));

but not calling Dispose() on it. What's going on in that function and that class? Is the using not needed, or is it actually needed? If it's needed, then that may be your memory leak.

Gjeltema
  • 4,122
  • 2
  • 24
  • 31
  • Thanks for your help on this, Gjeltema. Using a combination of what you had suggested and some additional pieces, I was able to get this to work without significant committed memory. Just for completion, let me address your queries, and I will also update my question above. 1. Yes, I do need duplicate values in the collections, that's the way the sparse matrices are initialized. I am sure they will never exceed memory capacity, because at iteration in the for loop I add at most topK entries. 2. You're right about the second "using" statement - I have edited my code (see above). – Avneesh Jun 10 '13 at 18:45
  • Response to side note: you're right, I just put it there because I'll need to use it in the next step that I implement. – Avneesh Jun 10 '13 at 18:54
  • By the way, I don't know if you can tell but I basically come from a more C/C++ background; could you point me to any resources that I could read up on to avoid such mistakes in the future and make my code more "C#ish"? – Avneesh Jun 11 '13 at 03:25
  • @Avneesh My apologies - I read this at work and completely forgot to get back to it after I got home. I come from a C\C++ background as well. For the two main issues in your question, you can read more about `using` and `IDisposable` starting here (http://msdn.microsoft.com/en-us/library/system.idisposable.aspx) and can also search for both and find quite a bit about it elsewhere. Regarding Linq/lambdas and deferred execution, that's a big subject. I can point you in the direction to get started, but it will take some doing to get used to it. – Gjeltema Jun 16 '13 at 14:42
  • You seem to have Linq usage down pretty well, so for deferred execution, I'd start by going here (http://stackoverflow.com/questions/410026/proper-use-of-yield-return) and going to Jon Skeet's answer and check out those articles. If you're still fuzzy after that, try searching "C# yield", and read some articles there. Feel free to hit me up (comment here if you desire), or of course post a question on SO, if you have further questions. – Gjeltema Jun 16 '13 at 14:44