I want to get all objects except text object as an image from PDF using iTextSharp

Question

I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp. What I've done so far is to get all text objects and image objects and locations. But I'm feeling difficult to get Table objects without texts. Actually it would be better if I can get them as images. My plan is to merge all objects except text objects as a background image and put text objects at proper locations. I tried to find similar questions here but no luck so far. If anyone knows how to do this particular job, please answer. Thanks.

There is nothing like a *table object* in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want. — mkl, Jan 02 '19 at 11:12
mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer. — JM217, Jan 02 '19 at 12:52
Implement `IExtRenderListener` which extends `IRenderListener` but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the *lines or colored rectangles* structuring your table. — mkl, Jan 02 '19 at 17:45
Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance. — JM217, Jan 03 '19 at 03:39

score 2 · Accepted Answer · answered Jan 07 '19 at 12:04

You say

What I've done so far is to get all text objects and image objects and locations.

but you don't go into detail how you do so. I assume you use a matching IRenderListener implementation.

But IRenderListener, as you found out yourself,

only extracts images and texts.

The main missing objects are paths and their usages.

To extract them, too, you should implement IExtRenderListener which extends IRenderListener but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:

First there are instructions for building the actual path; these instructions essentially
- move to some position,
- add a line to some position from the previous position,
- add a Bézier curve to some position from the previous position using some control points, or
- add an upright rectangle at some position using some width and height information.
Then there is an optional instruction to intersect the current clip path with the generated path.
Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.

This corresponds to the callbacks you retrieve in your IExtRenderListener implementation:

/**
 * Called when the current path is being modified. E.g. new segment is being added,
 * new subpath is being started etc.
 *
 * @param renderInfo Contains information about the path segment being added to the current path.
 */
void ModifyPath(PathConstructionRenderInfo renderInfo);

is called once or more often to build the actual path, PathConstructionRenderInfo containing the actual instruction type in its Operation property (compare to the PathConstructionRenderInfo constant members MOVETO, LINETO, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData property. The Ctm property additionally returns the affine transformation that currently is set to be applied to all drawing operations.

Then

/**
 * Called when the current path should be set as a new clipping path.
 *
 * @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
 */
void ClipPath(int rule);

is called if the current clip path shall be intersected with the constructed path.

Finally

/**
 * Called when the current path should be rendered.
 *
 * @param renderInfo Contains information about the current path which should be rendered.
 * @return The path which can be used as a new clipping path.
 */
Path RenderPath(PathPaintingRenderInfo renderInfo);

is called, PathPaintingRenderInfo containing the drawing operation in its Operation property (any combination of the PathPaintingRenderInfo constants STROKE and FILL), the rule for determining what "inside the path" means in its Rule property (NONZERO_WINDING_RULE or EVEN_ODD_RULE), and some other drawing details in the Ctm, LineWidth, LineCapStyle, LineJoinStyle, MiterLimit, and LineDashPattern properties.

Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:https://stackoverflow.com/questions/54059341/can-i-remove-text-objects-from-an-existing-pdf-and-output-to-a-new-pdf-using-ite — JM217, Jan 07 '19 at 12:20

score 0 · Answer 2 · answered Jan 02 '19 at 10:08

try to implement IRenderListener

  internal class ImageExtractor : IRenderListener
{
    private int _currentPage = 1;
    private int _imageCount = 0;
    private readonly string _outputFilePrefix;
    private readonly string _outputFolder;
    private readonly bool _overwriteExistingFiles;

    private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
    {
        _outputFilePrefix = outputFilePrefix;
        _outputFolder = outputFolder;
        _overwriteExistingFiles = overwriteExistingFiles;
    }

    /// <summary>
    /// Extract all images from a PDF file
    /// </summary>
    /// <param name="pdfPath">Full path and file name of PDF file</param>
    /// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
    /// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
    /// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
    /// <returns>Count of number of images extracted.</returns>
    public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
    {
        // Handle setting of any default values
        outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
        outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;

        var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);

        using (var pdfReader = new PdfReader(pdfPath))
        {
            if (pdfReader.IsEncrypted())
                throw new ApplicationException(pdfPath + " is encrypted.");

            var pdfParser = new PdfReaderContentParser(pdfReader);

            while (instance._currentPage <= pdfReader.NumberOfPages)
            {
                pdfParser.ProcessContent(instance._currentPage, instance);

                instance._currentPage++;
            }
        }

        return instance._imageCount;
    }

    #region Implementation of IRenderListener

    public void BeginTextBlock() { }
    public void EndTextBlock() { }
    public void RenderText(TextRenderInfo renderInfo) { }

    public void RenderImage(ImageRenderInfo renderInfo)
    {
        if (_imageCount == 0)
        {
            var imageObject = renderInfo.GetImage();

            var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
            var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);



            if (_overwriteExistingFiles || !File.Exists(imagePath))
            {
                var imageRawBytes = imageObject.GetImageAsBytes();
                //create a new file ()
                File.WriteAllBytes(imagePath, imageRawBytes);

            }
        }
        _imageCount++;
    }

    #endregion // Implementation of IRenderListener

}

Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function. — JM217, Jan 02 '19 at 10:26

I want to get all objects except text object as an image from PDF using iTextSharp

2 Answers2

Linked