C# Memory Usage Problem

Question

I have an method and it converts pdf text into a list. After the process the memory usage increase too much. For example a 1000 page pdf use 300mb memory and i can't free it. I have readed some LOH articles but have not find a solution.

 public List<string> GetTextFromPdf()
    {
        if (_pdfDoc.Pages == null) return null;
        List<string> ocrList = new List<string>();

        foreach (var words in _pdfDoc.Pages.Select(s => s.Value.WordList))
        {
            ocrList.AddRange(words.Select(word => word.Word).Select(input => Regex.Replace(input, @"[\W]", "")));
        }

        GC.Collect();
        return ocrList;
    }

Don't re-parse the regex every time - use a shared `Regex` instance — SLaks, Jun 19 '11 at 13:35
@SLaks what do you mean don't parse the regex every time. Can you give me an example — Orhan Cinar, Jun 19 '11 at 13:40
@Henk Holterman The pdf file is 100mb. After opening the pdf the mem increase only 10mb. The problem is at the parse process. I watch mem usage from Process Explorer. — Orhan Cinar, Jun 19 '11 at 13:47
http://blogs.msdn.com/b/bclteam/archive/2010/06/25/optimizing-regular-expression-performance-part-i-working-with-the-regex-class-and-regex-objects.aspx — SLaks, Jun 19 '11 at 13:58

score 5 · Accepted Answer · edited May 23 '17 at 10:08

5

This is about normal for a 100 megabyte .pdf. You load the entire thing in memory, that takes double the amount of memory since a character in .NET takes 2 bytes. You will also create a bunch of garbage in the large object heap for the list. Add the typical .NET runtime overhead and 300 megabytes is not an unexpected result.

Check this answer for details on how using the List<>.Capacity property can help reduce the LOH demands.

edited May 23 '17 at 10:08

Community

1
1

answered Jun 19 '11 at 14:14

Hans Passant

922,412
146
1,693
2,536

I'm curious about why you think PDFs aren't already in Unicode? – EricLaw Jun 19 '11 at 14:44
@Eric - PDF has been around too long to benefit from Unicode standardization. It has the typical 8-bit encoding zoo, "WinAnsi" is one of them. – Hans Passant Jun 19 '11 at 15:08
i cleared the list and set capacity to 0 but the memory usage is still same – Orhan Cinar Jun 19 '11 at 15:10
Yes, it is rare for the Windows memory manager to release virtual memory. The odds that the released memory exactly matches a memory mapping is very low. Not a problem, it is virtual. Minimize your main window to make yourself feel better. The linked answer tried to explain how to reduce LOH usage by not allocating it in the first place. – Hans Passant Jun 19 '11 at 15:19
@Orhan : It's not (just) about clearing the list but about allocating it wisely. Like `ocrList = new List(_pdfDoc.Pages.Count * OverEstimateWordsPerPage);` – H H Jun 19 '11 at 15:42

score 0 · Answer 2 · answered Jun 19 '11 at 13:37

0

Check if your pdf loader is referenced somewhere - so it can not be disposed.

answered Jun 19 '11 at 13:37

Piotr Auguscik

3,651
1
22
30

The problem is after adding the words to my ocrList. When i call _pdfDoc.Dispose() the app crashes. Because the pdf is still inside a viewer. – Orhan Cinar Jun 19 '11 at 13:44

score 0 · Answer 3 · answered Jun 19 '11 at 13:38

0

Is your pdf library COM based? You may need to call Marshall.releasecomobject on some of your references when you have finished with them.

answered Jun 19 '11 at 13:38

Bob Vale

18,094
1
42
49

C# Memory Usage Problem

3 Answers3