My task is to extract text from PDF for a specific coordinates.
I have used Apache Pdfbox client for data extraction .
To get the x, y , height and width coordinates from the PDF i am using PDF X change tool which is in Millimeter. When i pass the value in the rectangle the values are not getting empty value.
public String getTextUsingPositionsUsingPdf(String pdfLocation, int pageNumber, double x, double y, double width,
double height) throws IOException {
String extractedText = "";
// PDDocument Creates an empty PDF document. You need to add at least
// one page for the document to be valid.
// Using load method we can load a PDF document
PDDocument document = null;
PDPage page = null;
try {
if (pdfLocation.endsWith(".pdf")) {
document = PDDocument.load(new File(pdfLocation));
int getDocumentPageCount = document.getNumberOfPages();
System.out.println(getDocumentPageCount);
// Get specific page. THe parameter is pageindex which starts with // 0. If we need to
// access the first page then // the pageIdex is 0 PDPage
if (getDocumentPageCount > 0) {
page = document.getPage(pageNumber + 1);
} else if (getDocumentPageCount == 0) {
page = document.getPage(0);
}
// To create a rectangle by passing the x axis, y axis, width and height
Rectangle2D rect = new Rectangle2D.Double(x, y, width, height);
String regionName = "region1";
// Strip the text from PDF using PDFTextStripper Area with the
// help of Rectangle and named need to given for the rectangle
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
stripper.addRegion(regionName, rect);
stripper.extractRegions(page);
System.out.println("Region is " + stripper.getTextForRegion("region1"));
extractedText = stripper.getTextForRegion("region1");
} else {
System.out.println("No data return");
}
} catch (IOException e) {
System.out.println("The file not found" + "");
} finally {
document.close();
}
// Return the extracted text and this can be used for assertion
return extractedText;
}
Please suggest whether my way is correct or not..