0

In my project, I'm stuck with necessity to parse PDF file, that contains some characters rendered by Type3 fonts. So, what I need to do is to render such characters into BufferedImage for further processing.

I'm not sure if I'm looking in correct way, but I'm trying to get PDType3CharProc for such characters:

PDType3Font font = (PDType3Font)textPosition.getFont();
PDType3CharProc charProc = font.getCharProc(textPosition.getCharacterCodes()[0]);

and the input stream of this procedure contains following data:

54 0 1 -1 50 43 d1
q
49 0 0 44 1.1 -1.1 cm
BI
/W 49
/H 44
/BPC 1
/IM true
ID
<some binary data here>
EI
Q

but unfortunately I don't have any idea how can I use this data to render character into an image using PDFBox (or any other Java libraries).

Am I looking in correct direction, and what can I do with this data? If not, are there some other tools that can solve such problem?

  • Does my answer help? If not, please share a sample PDF representative for your Type 3 fonts. As already mentioned in my answer and then discussed with @Tilman in comments to it, there might be Type 3 font variations to handle differently... – mkl Feb 13 '17 at 10:09

2 Answers2

1

Unfortunately PDFBox out-of-the-box does not provide a class to render contents of arbitrary XObjects (like the type 3 font char procs), at least as far as I can see.

But it does provide a class for rendering complete PDF pages; thus, to render a given type 3 font glyph, one can simply create a page containing only that glyph and render this temporary page!

Assuming, for example, the type 3 font is defined on the first page of a PDDocument document and has name F1, all its char procs can be rendered like this:

PDPage page = document.getPage(0);
PDResources pageResources = page.getResources();
COSName f1Name = COSName.getPDFName("F1");
PDType3Font fontF1 = (PDType3Font) pageResources.getFont(f1Name);
Map<String, Integer> f1NameToCode = fontF1.getEncoding().getNameToCodeMap();

COSDictionary charProcsDictionary = fontF1.getCharProcs();
for (COSName key : charProcsDictionary.keySet())
{
    COSStream stream = (COSStream) charProcsDictionary.getDictionaryObject(key);
    PDType3CharProc charProc = new PDType3CharProc(fontF1, stream);
    PDRectangle bbox = charProc.getGlyphBBox();
    if (bbox == null)
        bbox = charProc.getBBox();
    Integer code = f1NameToCode.get(key.getName());

    if (code != null)
    {
        PDDocument charDocument = new PDDocument();
        PDPage charPage = new PDPage(bbox);
        charDocument.addPage(charPage);
        charPage.setResources(pageResources);
        PDPageContentStream charContentStream = new PDPageContentStream(charDocument, charPage);
        charContentStream.beginText();
        charContentStream.setFont(fontF1, bbox.getHeight());
        charContentStream.getOutput().write(String.format("<%2X> Tj\n", code).getBytes());
        charContentStream.endText();
        charContentStream.close();

        File result = new File(RESULT_FOLDER, String.format("4700198773-%s-%s.png", key.getName(), code));
        PDFRenderer renderer = new PDFRenderer(charDocument);
        BufferedImage image = renderer.renderImageWithDPI(0, 96);
        ImageIO.write(image, "PNG", result);
        charDocument.close();
    }
}

(RenderType3Character.java test method testRender4700198773)


Considering the textPosition variable in the OP's code, he quite likely attempts this from a text extraction use case. Thus, he'll have to either pre-generate the bitmaps as above and simply look them up by name or adapt the code to match the available information in his use case (e.g. he might not have the original page at hand, only the font object; in that case he cannot copy the resources of the original page but instead may create a new resources object and add the font object to it).


Unfortunately the OP did not provide a sample PDF. Thus I used one from another stack overflow question, 4700198773.pdf from extract text with custom font result non readble for my test. There obviously might remain issues with the OP's own files.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • reminds me that I had some code for PDFDebugger that was uncommitted. It uses the same strategy (create a PDF). Yours has one small flaw: it won't work when the BBox is very small. See font T4 in the file from PDFBOX-2959 (which is a very weird file, I have to admit) – Tilman Hausherr Feb 06 '17 at 17:26
  • *"[PDFBOX-2959](https://issues.apache.org/jira/browse/PDFBOX-2959)"* - funny animal, but valid at first glance. Ok, for such cases one should parse the content stream and determine the actual boundary box if the method used above results in a degenerated boundary box. That can actually combined with the parsing in `charProc.getGlyphBBox()` and so not create too much additional overhead. The implementation is left as an exercise for the reader... ;) – mkl Feb 06 '17 at 17:48
  • @TilmanHausherr The Type3 fonts in [sdnlist.pdf](https://www.treasury.gov/ofac/downloads/sdnlist.pdf) referenced in [this question](http://stackoverflow.com/q/42073700/1729265) also fail the method above. Oh well, the OP should tell whether things work alright for his files... – mkl Feb 07 '17 at 10:15
  • they show up in PDFDebugger font display (change from yesterday) although the glyphs are upside down. – Tilman Hausherr Feb 07 '17 at 10:28
  • @TilmanHausherr The font indeed works with mirroring upside-down twice. Probably you did not apply the Matrix? – mkl Feb 07 '17 at 10:50
  • I applied the reverse of it to avoid them showing up tiny. This worked fine... until now. Sigh. – Tilman Hausherr Feb 07 '17 at 11:01
0

I stumbled upon the same issue and I was able to render Type3 font by modifying PDFRenderer and the underlying PageDrawer:

class Type3PDFRenderer extends PDFRenderer
{

    private PDFont font;

    public Type3PDFRenderer(PDDocument document, PDFont font)
    {
        super(document);
        this.font = font;
    }

    @Override
    protected PageDrawer createPageDrawer(PageDrawerParameters parameters) throws IOException
    {
        FontType3PageDrawer pd = new FontType3PageDrawer(parameters, this.font);
        pd.setAnnotationFilter(super.getAnnotationsFilter());//as done in the super class
        return pd;
    }       
}

class FontType3PageDrawer extends PageDrawer
{

    private PDFont font;

    public FontType3PageDrawer(PageDrawerParameters parameters, PDFont font) throws IOException
    {
        super(parameters);
        this.font = font;
    }

    @Override
    public PDGraphicsState getGraphicsState()
    {
        PDGraphicsState gs = super.getGraphicsState();
        gs.getTextState().setFont(this.font);
        return gs;
    }       
}

Simply use Type3PDFRenderer instead of PDFRendered. Of course if you have multiple fonts this needs some more modification to handle them.

Edit: tested with pdfbox 2.0.9

STM
  • 954
  • 6
  • 16