2

I am working with a 32-bit Console Application that operates as a background processor. The part I am working on uses GhostScript to Perform OCR on PDFs. Each page of a PDF is rendered to a PNG image in a temp folder which the OCR Reader then reads. The OCR text is saved to a database and the files in the temp folder are then deleted.

The problem is with the GhostScriptRasterizer object eating all of the memory the processor has available. When I call the GhostScriptRasterizer.GetPage(dpi, dpi, pageNumber) method I get either get an OutOfMemory Exception or a System.ArgumentException with Message "Parameter is not valid". My research on the second exception tells me it is really a symptom of the first. The method call eats all of the avialable memory.

The GetPage method is creating a System.Drawing.Bitmap image which requires contiguous unfragmented memory. The problem code begins here.

try
{
    img = rasterizer.GetPage(dpi, dpi, pageNumber);
}
catch (OutOfMemoryException ex)
{
                    
    img = GetImage(rasterizer, dpi, pageNumber, ms);
}
catch (System.ArgumentException ex)
{                       
    img = GetImage(rasterizer, dpi, pageNumber, ms);
}

The GetImage method I wrote looks like this.

public Image GetImage(GhostscriptRasterizer rasterizer, int dpi, int pageNumber, MemoryStream ms)
{
    rasterizer.Close();
    rasterizer.Dispose();
    rasterizer = new GhostscriptRasterizer();
    rasterizer.Open(ms);
    dpi = dpi - 50;
    Image image = null;
    if (dpi > 0)
    {
        try
        {
            image = rasterizer.GetPage(dpi, dpi, pageNumber);
        }
        catch (OutOfMemoryException ex)
        {                   
            image = GetImage(rasterizer, dpi, pageNumber, ms);
        }
        catch (System.ArgumentException ex)
        {                   
            image = GetImage(rasterizer, dpi, pageNumber, ms);
        }
    }

    return image;
}

The dpi I start with is 300 and it has worked for 95% of the documents we have run through our first test of this system. However for certain pages 300 dpi is clearly too high as I get the Outofmemory exception. It looks like some of the pages are about 35 X 59 inches. I have no control over this. The solution for me is to keep trying at a lower and lower dpi until I have something that doesn't eat all of the memory. However, all of that memory remains in the rasterizer object so I need to dispose of it somehow. Calling rasterizer.Close() gives me the following error.

Managed Debugging Assistant 'FatalExecutionEngineError' has detected a problem in 'F:\Development\bin\Debug\Processor.Run.vshost.exe'.

Additional information: The runtime has encountered a fatal error. The address of the error was at 0x7331e8c6, on thread 0x3e90. The error code is 0xc0000005. This error may be a bug in the CLR or in the unsafe or non-verifiable portions of user code. Common sources of this bug include user marshaling errors for COM-interop or PInvoke, which may corrupt the stack.

Removing the Close() call and calling rasterizer.Dispose() gives me:

An unhandled exception of type 'System.AccessViolationException' occurred in Ghostscript.NET.dll

Additional information: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

I even just tried to just break if I hit an exception and return the file list and this still required me to not use a using declaration for the rasterizer because I got the same exception at the end of the using because of course it is trying to Dispose of the object. It appears the garbage collector picked up that memory later down the line but that does not in any way solve my problem. I still have no way of rasterizing the page within the same job.

The only solution I can think of is somehow resizing the pdf ahead of time but I'm hoping someone knows a way of disposing that memory and re-rasterizing at a new lower dpi.

Community
  • 1
  • 1
Aaron
  • 41
  • 1
  • 8

3 Answers3

1

You can write PostScript that alters the media size when a PDF requests a large media. But that would require some PostScript programming knowledge.

I believe the actual problem is not Ghostscript, however, because when exceeding memory limits Ghostscript will switch to a display list model where it outputs the page to disk in bands (running the display list as many times as there are bands to output). Provided you actually have a disk, which you clearly do, and there's enough memory for one raster line, then it will (eventually in the case of one band per line) output the whole thing.

Which suggests to me the actual problem is with the C++ or C# wrapper you are using, not Ghostscript tself.

I suspect that your wrapper is trying to create a huge bitmap in memory to hold the rendered output before writing it to disk. That isn't required.

Try running Ghostscript directly from the command line with one of your failing files, if that works then you can simply use Ghostscript, its perfectly capable of producing a PNG file as output. For what its worth I have used Ghostscript to output media of that size, and larger, at 600 dpi.

KenS
  • 30,202
  • 3
  • 34
  • 51
  • I imagine that you are correct that GhostScript.NET is the issue here. However the processor that is running this operates as the background processor for a website and performs a variety of jobs. I need the .NET wrapper so I can use GhostScript within the system to service OCR jobs on the fly as documents are uploaded. – Aaron Oct 28 '15 at 14:30
  • Then you'll have to take it up with jhabjan, the maintainer behind Ghostscript.NET, I can't help you with that. By the way, ths is an internal system, right ? – KenS Oct 28 '15 at 14:32
  • @Aaron I'm having a problem similar to yours. Did you resolve this while still using GhostScript.NET? – Scotty H Nov 03 '15 at 20:22
  • @Aaron If there's anyway you can switch to a 64 bit application, I've been having good success with the 64 bit version of Ghostscript. No Out of Memory exceptions so far. – Scotty H Nov 04 '15 at 14:49
0

I have a similar issue, I get the "Attempted to read or write protected memory" when disposing memory after an exception occurs. This happens when I am trying to convert a password-protected PDF - even after catching the exception, the above access violation occurs and crashes the program.

The solution I used:

I am also using iTextSharp in my program. So I wrote a method using iTextSharp to check if the PDF file is password protected first, using help from this thread: https://stackoverflow.com/questions/11298651/checking-if-pdf-is-password-protected-using-itextsharp#=

So now I am checking for the problem before I run into it. It's the only way I've found around this problem - I don't think the Ghostscript.NET wrapper is being updated or maintained any more.

Community
  • 1
  • 1
Adam Elders
  • 323
  • 1
  • 4
  • 15
0

I used HandleProcessCorruptedStateExceptionsAttribute and SecurityCritical attributes on top of my method which is calling GhostScript method.

This got issue fixed for me. I no more get this exception.

Anshul Goyal
  • 73,278
  • 37
  • 149
  • 186