2

I have an method and it converts pdf text into a list. After the process the memory usage increase too much. For example a 1000 page pdf use 300mb memory and i can't free it. I have readed some LOH articles but have not find a solution.

 public List<string> GetTextFromPdf()
    {
        if (_pdfDoc.Pages == null) return null;
        List<string> ocrList = new List<string>();

        foreach (var words in _pdfDoc.Pages.Select(s => s.Value.WordList))
        {
            ocrList.AddRange(words.Select(word => word.Word).Select(input => Regex.Replace(input, @"[\W]", "")));
        }

        GC.Collect();
        return ocrList;
    }
Orhan Cinar
  • 8,403
  • 2
  • 34
  • 48

3 Answers3

5

This is about normal for a 100 megabyte .pdf. You load the entire thing in memory, that takes double the amount of memory since a character in .NET takes 2 bytes. You will also create a bunch of garbage in the large object heap for the list. Add the typical .NET runtime overhead and 300 megabytes is not an unexpected result.

Check this answer for details on how using the List<>.Capacity property can help reduce the LOH demands.

Community
  • 1
  • 1
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • I'm curious about why you think PDFs aren't already in Unicode? – EricLaw Jun 19 '11 at 14:44
  • @Eric - PDF has been around too long to benefit from Unicode standardization. It has the typical 8-bit encoding zoo, "WinAnsi" is one of them. – Hans Passant Jun 19 '11 at 15:08
  • i cleared the list and set capacity to 0 but the memory usage is still same – Orhan Cinar Jun 19 '11 at 15:10
  • Yes, it is rare for the Windows memory manager to release virtual memory. The odds that the released memory exactly matches a memory mapping is very low. Not a problem, it is virtual. Minimize your main window to make yourself feel better. The linked answer tried to explain how to reduce LOH usage by not allocating it in the first place. – Hans Passant Jun 19 '11 at 15:19
  • @Orhan : It's not (just) about clearing the list but about allocating it wisely. Like `ocrList = new List(_pdfDoc.Pages.Count * OverEstimateWordsPerPage);` – H H Jun 19 '11 at 15:42
0

Check if your pdf loader is referenced somewhere - so it can not be disposed.

Piotr Auguscik
  • 3,651
  • 1
  • 22
  • 30
  • The problem is after adding the words to my ocrList. When i call _pdfDoc.Dispose() the app crashes. Because the pdf is still inside a viewer. – Orhan Cinar Jun 19 '11 at 13:44
0

Is your pdf library COM based? You may need to call Marshall.releasecomobject on some of your references when you have finished with them.

Bob Vale
  • 18,094
  • 1
  • 42
  • 49