0

So, I have an app, written in C# (vs2010), performing OCR using the tesseract 3.02 dll and Charles Weld's terreract .net wrapper.

I think I have a memory leak and it seems to be in the area of code where the Pix object is allocated. I am taking a PDF, converting that to a grayscale PNG, then loading that into a Pix object for OCR. When it works, it works really well. Image is large in size (5100 or so pixels in each dim) but not so large in size (only 500k or so).

My code:

Init engine at app startup:

private TesseractEngine engine = new TesseractEngine(@"./tessdata/", "eng+fra", EngineMode.Default);

Method to convert PDF to PNG, then calls:

// Load the image file created earlier into a Pix object.
Pix pixImage = Pix.LoadFromFile(Path.Combine(textBoxSourceFolder.Text, sourceFile));

And then calls the following:

// Perform OCR on the image referenced in the Pix object.
private String PerformImageOCR(Pix pixImage)
{
    int safety = 0;

    do
    {
        try
        {
            // Deskew the image.
            pixImage = pixImage.Deskew();
            //pixImage.Save(@"c:\temp\img_deskewed.png", ImageFormat.Png); // Debugging - verify image deskewed properly to allow good OCR.

            string text = "";

            // Use the tesseract OCR engine to process the image
            using (var page = engine.Process(pixImage))
            {
                // and then extract the text.
                text = page.GetText();
            }

            return text;
        }
        catch (Exception e)
        {
            MessageBox.Show(string.Format("There was an error performing OCR on image, Retrying.\n\nError:\n{0}", e.Message), "Error", MessageBoxButtons.OK);
        }
    } while (++safety < 3);

    return string.Empty;
}

I have observed that memory usage jumps by about 31MB when the Pix object is created, then jumps again while OCR is being performed, then finally settles about 33MB higher than before it started. ie: if app, after loading, was consuming 50MB, loading the Pix object causes the memory usage to jump to about 81MB. Performing OCR will see it spike to 114+MB, then, once the process is complete and the results saved, the memory usage settles to about 84MB. Repeating this over many files in a folder will eventually cause the app to barf at 1.5GB or so consumed.

I think my code is okay, but there's something somewhere that's holding onto resources.

The tesseract and leptonica dlls are written in C and I have recompiled them with VS2010 along with the latest or recommended image lib versions, as appropriate. What I'm unsure of, is how to diagnose a memory leak in a C dll from a C# app using visual studio. If I were using Linux, I'd use a tool such as valgrind to help me spot the leak, but my leak sniffing skills on the windows side are sadly lacking. Looking for advice on how to proceed.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
Jon
  • 1,675
  • 26
  • 57
  • 1
    I'm not familiar with this library, but does `Pix` implement `IDisposable`, and if so, are you calling `Dispose()` on your `pixImage` objects? That's the only thing I see. Of course the leak could be in the library yourself. You may need to contact the developers. Also [.NET memory profilers](http://stackoverflow.com/questions/3927/what-are-some-good-net-profilers) are pretty good at helping to identify leaks in managed code (not so good at unmanaged leaks though). – TypeIA Mar 24 '14 at 14:14
  • So, which bit allocates unmanaged resources, is it disposed or terminated? – Jodrell Mar 24 '14 at 14:14
  • Thanks @dnvrrs, upvoted your comment as being helpful! – Jon Mar 24 '14 at 16:01

2 Answers2

0

I'm not familliar with Tesseract or the wrapper, but for memory profiling issues, if you have Visual Studio 2012/2013, you can use the Performance Wizard. I know it's available in Ultimate, but not sure on other versions.

http://blogs.msdn.com/b/dotnet/archive/2013/04/04/net-memory-allocation-profiling-with-visual-studio-2012.aspx

It's either something in your code or something in the wrapper is not disposing an unmanaged object properly. My guess would be it's in the wrapper. Running the Performance Wizard or another C# memory profiler (like JetBrains DotTrace) may help you track it down.

Joe the Coder
  • 1,775
  • 1
  • 19
  • 22
0

Reading your code here I do not see you disposing your Pix pixImage anywhere? That's what is taking up all the resources when you are processing x images. Before you return your string result you should call the dispose method on your pixImage. That should reduce the amount of resources used by your program.

woutervs
  • 1,500
  • 12
  • 28
  • Pix does implement idisposable. I tried the suggestion of calling Dispose when I am finished, but this made no change to the memory usage. – Jon Mar 24 '14 at 15:31
  • Then I'd dare say that the implementation of the API is faulty. As I reckon, Pix image holds unmanaged resources. I would post your question on their forum (if they have one) or mail them directly for support. – woutervs Mar 24 '14 at 15:41
  • Aha... got it! The Deskew() method increases the refcount of the pix object in the dll. Calling Dispose() decreases the refcount by 1 each time it is called and only frees resources when refcount=0. So, thanks to you and to @dvnrrs (dunno who replied first with the same suggestion) for pointing me in the direction I needed to look. Question though.. I though we should not call the Dispose method ourselves? I've read numerous arguments "out there" on this and am still not clear if one should or should not. Clearly, in this case, I must, but... what's the "rule of thumb"? – Jon Mar 24 '14 at 16:01