In comments the OP clarified that he locates the text value from the table in a pdf file he wants to extract
By providing X and Y co-ordinates
Thus, while the question initially sounded like generic extraction of tabular data from PDFs (which can be difficult at least), it actually is essentially about extracting the text from a rectangular region on a page given by coordinates.
This is possible using either of the libraries you mentioned (and surely others, too).
iText
To restrict the region from which you want to extract text, you can use the RegionTextRenderFilter
in a FilteredTextRenderListener
, e.g.:
/**
* Parses a specific area of a PDF to a plain text file.
* @param pdf the original PDF
* @param txt the resulting text
* @throws IOException
*/
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
}
out.flush();
out.close();
reader.close();
}
(ExtractPageContentArea sample from iText in Action, 2nd edition)
Beware, though, iText extracts text based on the basic text chunks in the content stream, not based on each individual glyph in such a chunk. Thus, the whole chunk is processed if only the tiniest part of it is in the area.
This may or may not suit you.
If you run into the problem that more is extracted than you wanted, you should split the chunks into their constituting glyphs beforehand. This stackoverflow answer explains how to do that.
PDFBox
To restrict the region from which you want to extract text, you can use the PDFTextStripperByArea
, e.g.:
PDDocument document = PDDocument.load( args[0] );
if( document.isEncrypted() )
{
document.decrypt( "" );
}
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
Rectangle rect = new Rectangle( 10, 280, 275, 60 );
stripper.addRegion( "class1", rect );
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 0 );
stripper.extractRegions( firstPage );
System.out.println( "Text in the area:" + rect );
System.out.println( stripper.getTextForRegion( "class1" ) );
(ExtractTextByArea from the PDFBox 1.8.8 examples)