3

Can I use ghostscript API to convert PDF to some other format without reading data from disk or writing results to disk? It has a big overhead!

I need something like this:

public static byte[][] ConvertPDF(byte[] pdfData)
{
 //// Returns an array of byte-array of pages data
}
Saw
  • 6,199
  • 11
  • 53
  • 104

2 Answers2

2

Using the Ghostscript API you can send input from anywhere you like. Depending on the output device you choose you may be able to send the output to stdout, or to retrieve a bitmap in memory.

If you want TIFF output then you have to have an output file (Tagged Image File Format, the clue is in the name...)

Similarly, you can't do this with PDF files as input, those have to be available as a file, because PDF is a random access format.

What leads you to think that this is a performance problem ?

KenS
  • 30,202
  • 3
  • 34
  • 51
  • Much of this is incorrect. TIFFs can be represented as a byte array just like any other file. And byte arrays certainly don't preclude random access--in fact, they make them much faster (if you have sufficient memory). And dealing with file IO instead of manipulating files in-memory is, in fact, much slower (again, if you have the memory). Your conclusion was (at the time) correct, but your rationale was not. – Daniel Aug 26 '22 at 17:40
  • I disagree, it is all correct **in the context of Ghostscript**. There is no Ghostscript device to hold TIFF in memory, if you want to write one, go ahead, but right now there isn't one. PDF files can't readily be dealt with in memory because they have to be dealt with in PostScript, especially if being fed to the interpreter via stdin, which has limits on the size of objects (usually 64KB). So I stand by my answer, when taken in context. The question is entirely about Ghostscript and the Ghostscript API, my answer is therefore restricted to Ghostscript. – KenS Aug 26 '22 at 18:16
  • If your intent was to restrict the context to ghostscript, it was not clear to me. Your statements about PDF & TIFF formats read as general statements, not statements about how GS works with those formats (e.g. the clue is NOT in the name, since the name is not specific to GS and the conclusion is not true outside of GS). The "real" answer is much less intuitive; that Ghostscript is not written to manipulate the file in-memory, and nobody is eager to re-write it for free. Even the statement in your comment about PDFs not being manipulable in memory reads to me as general--and, if so, is false. – Daniel Aug 26 '22 at 21:08
  • Well I'm sorry you misunderstood my 9 year old answer. The whole question is about Ghostscript (it isn't a very large question) so I thought it was reasonably obvious my answer was as well. The light hearted comment, well I guess humour is best avoided. – KenS Aug 27 '22 at 07:43
0

Since there still isn't a correct answer here all these years later, I'll provide one.

Ghostscipt performs its operations on disk. It doesn't use an input & output path merely to load the file into memory, perform operations, and write it back. It actually reads and writes parts of the file to disk as it goes (using multiple threads). While this IS slower, it also uses much less memory(bearing in mind that these files could potentially be quite large).

Because the operations are performed on disk, there was not (at the time of this question) any way to pass in or retrieve a byte array/memory stream because to do so would be "dishonest"--it might imply that it was a "shortcut" to prevent disk IO when in fact it would not. Later, support was added to accept & return memory streams, but it's important to note that this support merely accepted the memory stream, wrote it to a temporary file, performed the operations, and then read it back to a new memory stream.

If that still meets your needs (for example, if you want the inevitable IO to be handled by the library rather than your business logic), here are a couple links demonstrating how to go about it (your exact needs do change the mechanics).

Image to pdf (memory stream to memory stream via rasterizer)

Image to pdf (file to memory stream via processor)

Pdf to image (memory stream to memory stream via rasterizer)

Hopefully these will, collectively, provide enough information to solve this issue for others who, like me & OP, mostly found people saying it was impossible and that I shouldn't even be trying.

Daniel
  • 1,695
  • 15
  • 33
  • 1
    For completeness then; You can do what is wanted, but to do so you need to modify Ghostscript. It is not possible with Ghostscript as delivered, and the method post-dates the question. You can define a file system in Ghostscript and implement it yourself, then direct Ghostscript to read files and write files to that file system. Example, %ram% (implemented by a contributor) here https://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=db45b95faa06f8204b9a075323125d7f398c5d06. You would then read from %ram%/path/filename and direct Ghostscript to write to %ram%/path/outputfilename – KenS Aug 27 '22 at 18:47
  • 1
    Obviously you would want to define your own file system and give it a different name! I did note in the original answer that the output could be written to stdout, and that it was possible to retrieve a bitmap in memory. One of your examples uses the pdfwrite device and writes the output to a pipe, that's not a good plan, because the device expects to be writing to a seekable file, under some conditions it will reposition the file pointer and update the file, which obviously isn't going to work with a pipe. – KenS Aug 27 '22 at 18:54
  • The RAM FS is a very clever workaround. Over-engineered in the most delightful way, like doom running in Excel. Are you able to provide any evidence for the statement that GS expects to be piping to a seekable file? I don't believe that's actually true with GS and I wasn't able to break it with cursory testing. – Daniel Aug 28 '22 at 19:07
  • 1
    Oh maaan, you woke up old memorys.... That was a very important question for a very important project, we ended up using a ghostscript or a different tool and in-memory disks to trick the tool and leverage the abundance of RAM. That worked really well, we processed more than a billion pages of PDF files that way in tens of hefty on pfdm servers! – Saw Aug 28 '22 at 19:35
  • I meant ramdisks – Saw Aug 28 '22 at 19:42
  • Evidence requiring a seekable file for the pdfwrite device ? Yes, an example would be a linearized output PDF file which requires leaving an empty section in the file which is then overwritten with data later, after the remainder of the file has been output. Also the way it works it reads the original output file, write a temporary file then reads that back over the original file IIRC ghostpdl/devices/vector/gdevpdf.c at around line 2029. Or just look for all the occurrences of gp_fseek in that file. – KenS Aug 29 '22 at 19:14
  • The ramfs code is intended as an example of adding a file system, not a solution in itself. You have to provide the input in a way that appears as a PostScript file in order for the PostScript interpreter to be able to read it, and since the PDF interpreter was (at the time) written in PostScript, and partially still is, making the file appear as a PostScript file is required. Hence the simplest solution actually is to have a custom file system. Which obviously vanilla Ghostscript does not provide. – KenS Aug 29 '22 at 19:16
  • Are you sure about the seekable file issue? Have you reproduced it? I'm asking because I believe the way GS works with `MemoryStream`s is that ALL operations are performed on disk and the resulting file is then written to the output file/pipe. I'd honestly be surprised if that wasn't the case. – Daniel Aug 29 '22 at 20:28