We have been using the iText based PdfVeryDenseMergeTool we found in this SO question How To Remove Whitespace on Merge to merge multiple PDF files into a single PDF file. The tool merges PDFs without leaving any whitespace in between, and individual PDFs also get broken out across pages when possible.
We want to port PdfVeryDenseMergeTool to PDFBox. We found a PDFBox 2 based PdfDenseMergeTool that merges PDFs like this:
Individual PDFs:
Dense Merged PDF:
We are looking for something like this (this is already one in iText based PdfVeryDenseMergeTool but we want to do it using PDFBox 2) :
In our attempt to do the porting, we found that PdfVeryDenseMergeTool uses a PageVerticalAnalyzer that extends iText PDF Render Listener and does something every time a text, image, or arc is drawn in a PDF. And all the rendering info is then used to split an individual PDF across multiple pages. We tried looking for a similar PDF Render Listener in PDFBox 2 but found that the available PDFRenderer class only has image rendering methods. So we are not sure how to port PageVerticalAnalyzer to PDFBox.
If someone can suggest an approach to move forward, we'd greatly appreciate their help.
Thanks a lot!
EDIT 7 Feb 2020
At present, we are extending PDFGraphicsStreamEngine from PDFBox to make a custom rendering engine that tracks coordinates of images, text lines, and arcs when they are drawn. That custom engine will be the port of the PageVerticalAnalyzer. After that, we are hoping to be able to port PdfVeryDenseMergeTool to PDFBox.
EDIT 8 Feb 2020
Here is a very simple port of PageVerticalAnalyzer that handles images and text. I'm a PDFBox newbie, so my logic to handle images is probably wonky. Here's the basic approach:
Text: for every glyph printed, get the bottomY and make topY = bottomY + charHeight, mark those top/bottom points.
Image: for every call to drawImage(), it looks like there are two ways to figure out where it was drawn. First is using the coords from the last call to appendRectangle() and second is using the last calls to moveTo(), multiple lineTo(), and closePath(). I give the latter one priority. If I can't find any path (I found it in one PDF, in another, before drawImage(), I only found appendRectangle()), I use the former. If none of them exist, I have no clue what to do. Here's how I'm assuming PDFBox marks image coords using moveTo()/lineTo()/closePath():
Here is my current implementation:
import java.awt.geom.Point2D;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.contentstream.PDFGraphicsStreamEngine;
import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.graphics.image.PDImage;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.util.Matrix;
import org.apache.pdfbox.util.Vector;
public class PageVerticalAnalyzer extends PDFGraphicsStreamEngine
{
/**
* This is a port of iText based PageVerticalAnalyzer found here
* https://github.com/mkl-public/testarea-itext5/blob/master/src/main/java/mkl/testarea/itext5/merge/PageVerticalAnalyzer.java
*
* @param page PDF Page
*/
protected PageVerticalAnalyzer(PDPage page)
{
super(page);
}
public static void main(String[] args) throws IOException
{
File file = new File("q2.pdf");
try (PDDocument doc = PDDocument.load(file))
{
PDPage page = doc.getPage(0);
PageVerticalAnalyzer engine = new PageVerticalAnalyzer(page);
engine.run();
System.out.println(engine.verticalFlips);
}
}
/**
* Runs the engine on the current page.
*
* @throws IOException If there is an IO error while drawing the page.
*/
public void run() throws IOException
{
processPage(getPage());
for (PDAnnotation annotation : getPage().getAnnotations())
{
showAnnotation(annotation);
}
}
// All path related stuff
@Override
public void clip(int windingRule) throws IOException
{
System.out.println("clip");
}
@Override
public void moveTo(float x, float y) throws IOException
{
System.out.printf("moveTo %.2f %.2f%n", x, y);
lastPathBottomTop = new float[] {(Float) null, y};
}
@Override
public void lineTo(float x, float y) throws IOException
{
System.out.printf("lineTo %.2f %.2f%n", x, y);
lastLineTo = new float[] {x, y};
}
@Override
public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
{
System.out.printf("curveTo %.2f %.2f, %.2f %.2f, %.2f %.2f%n", x1, y1, x2, y2, x3, y3);
}
@Override
public Point2D getCurrentPoint() throws IOException
{
// if you want to build paths, you'll need to keep track of this like PageDrawer does
return new Point2D.Float(0, 0);
}
@Override
public void closePath() throws IOException
{
System.out.println("closePath");
lastPathBottomTop[0] = lastLineTo[1];
lastLineTo = null;
}
@Override
public void endPath() throws IOException
{
System.out.println("endPath");
}
@Override
public void strokePath() throws IOException
{
System.out.println("strokePath");
}
@Override
public void fillPath(int windingRule) throws IOException
{
System.out.println("fillPath");
}
@Override
public void fillAndStrokePath(int windingRule) throws IOException
{
System.out.println("fillAndStrokePath");
}
@Override
public void shadingFill(COSName shadingName) throws IOException
{
System.out.println("shadingFill " + shadingName.toString());
}
// Rectangle related stuff
@Override
public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
{
System.out.printf("appendRectangle %.2f %.2f, %.2f %.2f, %.2f %.2f, %.2f %.2f%n",
p0.getX(), p0.getY(), p1.getX(), p1.getY(),
p2.getX(), p2.getY(), p3.getX(), p3.getY());
lastRectBottomTop = new float[] {(float) p0.getY(), (float) p3.getY()};
}
// Image drawing
@Override
public void drawImage(PDImage pdImage) throws IOException
{
System.out.println("drawImage");
if (lastPathBottomTop != null) {
addVerticalUseSection(lastPathBottomTop[0], lastPathBottomTop[1]);
} else if (lastRectBottomTop != null ){
addVerticalUseSection(lastRectBottomTop[0], lastRectBottomTop[1]);
} else {
throw new Error("Drawing image without last reference!");
}
lastPathBottomTop = null;
lastRectBottomTop = null;
}
// All text related stuff
@Override
public void showTextString(byte[] string) throws IOException
{
System.out.print("showTextString \"");
super.showTextString(string);
System.out.println("\"");
}
@Override
public void showTextStrings(COSArray array) throws IOException
{
System.out.print("showTextStrings \"");
super.showTextStrings(array);
System.out.println("\"");
}
@Override
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode,
Vector displacement) throws IOException
{
// print the actual character that is being rendered
System.out.print(unicode);
super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
// rendering matrix seems to contain bounding box of dimensions the char
// and an x/y point where bounding box starts
//System.out.println(textRenderingMatrix.toString());
// y of the bottom of the char
// not sure why the y value is in the 8th column
// when I print the matrix, it shows up in the 6th column
float yBottom = textRenderingMatrix.getValue(0, 7);
// height of the char
// using the value in the first column as the char height
float yTop = yBottom + textRenderingMatrix.getValue(0, 0);
addVerticalUseSection(yBottom, yTop);
}
// Keeping track of bottom/top point pairs
void addVerticalUseSection(float from, float to)
{
if (to < from)
{
float temp = to;
to = from;
from = temp;
}
int i=0, j=0;
for (; i<verticalFlips.size(); i++)
{
float flip = verticalFlips.get(i);
if (flip < from)
continue;
for (j=i; j<verticalFlips.size(); j++)
{
flip = verticalFlips.get(j);
if (flip < to)
continue;
break;
}
break;
}
boolean fromOutsideInterval = i%2==0;
boolean toOutsideInterval = j%2==0;
while (j-- > i)
verticalFlips.remove(j);
if (toOutsideInterval)
verticalFlips.add(i, to);
if (fromOutsideInterval)
verticalFlips.add(i, from);
}
final List<Float> verticalFlips = new ArrayList<Float>();
private float[] lastRectBottomTop;
private float[] lastPathBottomTop;
private float[] lastLineTo;
}
I am looking for answers to the following questions:
- How can I improve this implementation?
- How to handle other things like curves that I have not handled?