I am trying to extract text coordinates and line (or rectangle) coordinates from a PDF.
The TextPosition
class has getXDirAdj()
and getYDirAdj()
methods which transform coordinates according to the direction of the text piece the respective TextPosition object represents (Corrected based on comment from @mkl)
The final output is consistent, irrespective of the page rotation.
The coordinates needed on the output are X0,Y0 (TOP LEFT CORNER OF THE PAGE)
This is a slight modification from the solution by @Tilman Hausherr. The y coordinates are inverted (height - y) to keep it consistent with the coordinates from the text extraction process, also the output is written to a csv.
public class LineCatcher extends PDFGraphicsStreamEngine
{
private static final GeneralPath linePath = new GeneralPath();
private static ArrayList<Rectangle2D> rectList= new ArrayList<Rectangle2D>();
private int clipWindingRule = -1;
private static String headerRecord = "Text|Page|x|y|width|height|space|font";
public LineCatcher(PDPage page)
{
super(page);
}
public static void main(String[] args) throws IOException
{
if( args.length != 4 )
{
usage();
}
else
{
PDDocument document = null;
FileOutputStream fop = null;
File file;
Writer osw = null;
int numPages;
double page_height;
try
{
document = PDDocument.load( new File(args[0], args[1]) );
numPages = document.getNumberOfPages();
file = new File(args[2], args[3]);
fop = new FileOutputStream(file);
// if file doesnt exists, then create it
if (!file.exists()) {
file.createNewFile();
}
osw = new OutputStreamWriter(fop, "UTF8");
osw.write(headerRecord + System.lineSeparator());
System.out.println("Line Processing numPages:" + numPages);
for (int n = 0; n < numPages; n++) {
System.out.println("Line Processing page:" + n);
rectList = new ArrayList<Rectangle2D>();
PDPage page = document.getPage(n);
page_height = page.getCropBox().getUpperRightY();
LineCatcher lineCatcher = new LineCatcher(page);
lineCatcher.processPage(page);
try{
for(Rectangle2D rect:rectList) {
String pageNum = Integer.toString(n + 1);
String x = Double.toString(rect.getX());
String y = Double.toString(page_height - rect.getY()) ;
String w = Double.toString(rect.getWidth());
String h = Double.toString(rect.getHeight());
writeToFile(pageNum, x, y, w, h, osw);
}
rectList = null;
page = null;
lineCatcher = null;
}
catch(IOException io){
throw new IOException("Failed to Parse document for line processing. Incorrect document format. Page:" + n);
}
};
}
catch(IOException io){
throw new IOException("Failed to Parse document for line processing. Incorrect document format.");
}
finally
{
if ( osw != null ){
osw.close();
}
if( document != null )
{
document.close();
}
}
}
}
private static void writeToFile(String pageNum, String x, String y, String w, String h, Writer osw) throws IOException {
String c = "^" + "|" +
pageNum + "|" +
x + "|" +
y + "|" +
w + "|" +
h + "|" +
"999" + "|" +
"marker-only";
osw.write(c + System.lineSeparator());
}
@Override
public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
{
// to ensure that the path is created in the right direction, we have to create
// it by combining single lines instead of creating a simple rectangle
linePath.moveTo((float) p0.getX(), (float) p0.getY());
linePath.lineTo((float) p1.getX(), (float) p1.getY());
linePath.lineTo((float) p2.getX(), (float) p2.getY());
linePath.lineTo((float) p3.getX(), (float) p3.getY());
// close the subpath instead of adding the last line so that a possible set line
// cap style isn't taken into account at the "beginning" of the rectangle
linePath.closePath();
}
@Override
public void drawImage(PDImage pdi) throws IOException
{
}
@Override
public void clip(int windingRule) throws IOException
{
// the clipping path will not be updated until the succeeding painting operator is called
clipWindingRule = windingRule;
}
@Override
public void moveTo(float x, float y) throws IOException
{
linePath.moveTo(x, y);
}
@Override
public void lineTo(float x, float y) throws IOException
{
linePath.lineTo(x, y);
}
@Override
public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
{
linePath.curveTo(x1, y1, x2, y2, x3, y3);
}
@Override
public Point2D getCurrentPoint() throws IOException
{
return linePath.getCurrentPoint();
}
@Override
public void closePath() throws IOException
{
linePath.closePath();
}
@Override
public void endPath() throws IOException
{
if (clipWindingRule != -1)
{
linePath.setWindingRule(clipWindingRule);
getGraphicsState().intersectClippingPath(linePath);
clipWindingRule = -1;
}
linePath.reset();
}
@Override
public void strokePath() throws IOException
{
rectList.add(linePath.getBounds2D());
linePath.reset();
}
@Override
public void fillPath(int windingRule) throws IOException
{
linePath.reset();
}
@Override
public void fillAndStrokePath(int windingRule) throws IOException
{
linePath.reset();
}
@Override
public void shadingFill(COSName cosn) throws IOException
{
}
/**
* This will print the usage for this document.
*/
private static void usage()
{
System.err.println( "Usage: java " + LineCatcher.class.getName() + " <input-pdf>" + " <output-file>");
}
}
Was using the PDFGraphicsStreamEngine
class to extract Line and Rectangle coordinates. The coordinates of lines and rectangles do not align with the coordinates of the text
Green: Text Red: Line coordinates obtained as is Black: Expected coordinates (Obtained after applying transformation on the output)
Tried the setRotation()
method to correct for the rotation before running the line extract. However the results are not consistent.
What are the possible options to get the rotation and get a consistent output of the Line / Rectangle coordinates using PDFBox?