(This originally was the answer (dated Feb 6 '15) to another question which the OP deleted including all answers. Due to the age, the code in the answer was still based on PDFBox 1.8.x, so some changes might be necessary to make it run with PDFBox 2.0.x.)
In comments the OP showed interest in a solution to extend the PDFBox PDFTextStripper
to return text lines which attempt to reflect the PDF file layout which might help in case of the question at hand.
A proof-of-concept for that would be this class:
public class LayoutTextStripper extends PDFTextStripper
{
public LayoutTextStripper() throws IOException
{
super();
}
@Override
protected void startPage(PDPage page) throws IOException
{
super.startPage(page);
cropBox = page.findCropBox();
pageLeft = cropBox.getLowerLeftX();
beginLine();
}
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
float recentEnd = 0;
for (TextPosition textPosition: textPositions)
{
String textHere = textPosition.getCharacter();
if (textHere.trim().length() == 0)
continue;
float start = textPosition.getTextPos().getXPosition();
boolean spacePresent = endsWithWS | textHere.startsWith(" ");
if (needsWS | spacePresent | Math.abs(start - recentEnd) > 1)
{
int spacesToInsert = insertSpaces(chars, start, needsWS & !spacePresent);
for (; spacesToInsert > 0; spacesToInsert--)
{
writeString(" ");
chars++;
}
}
writeString(textHere);
chars += textHere.length();
needsWS = false;
endsWithWS = textHere.endsWith(" ");
try
{
recentEnd = getEndX(textPosition);
}
catch (IllegalArgumentException | IllegalAccessException | NoSuchFieldException | SecurityException e)
{
throw new IOException("Failure retrieving endX of TextPosition", e);
}
}
}
@Override
protected void writeLineSeparator() throws IOException
{
super.writeLineSeparator();
beginLine();
}
@Override
protected void writeWordSeparator() throws IOException
{
needsWS = true;
}
void beginLine()
{
endsWithWS = true;
needsWS = false;
chars = 0;
}
int insertSpaces(int charsInLineAlready, float chunkStart, boolean spaceRequired)
{
int indexNow = charsInLineAlready;
int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
int spacesToInsert = indexToBe - indexNow;
if (spacesToInsert < 1 && spaceRequired)
spacesToInsert = 1;
return spacesToInsert;
}
float getEndX(TextPosition textPosition) throws IllegalArgumentException, IllegalAccessException, NoSuchFieldException, SecurityException
{
Field field = textPosition.getClass().getDeclaredField("endX");
field.setAccessible(true);
return field.getFloat(textPosition);
}
public float fixedCharWidth = 3;
boolean endsWithWS = true;
boolean needsWS = false;
int chars = 0;
PDRectangle cropBox = null;
float pageLeft = 0;
}
It is used like this:
PDDocument document = PDDocument.load(PDF);
LayoutTextStripper stripper = new LayoutTextStripper();
stripper.setSortByPosition(true);
stripper.fixedCharWidth = charWidth; // e.g. 5
String text = stripper.getText(document);
fixedCharWidth
is the assumed character width. Depending on the writing in the PDF in question a different value might be more apropos. In my sample documents values from 3..6 were of interest.
It essentially emulates the analogous solution for iText in this answer. Results differ a bit, though, as iText text extraction forwards text chunks and PDFBox text extraction forwards individual characters.
Please be aware that this is merely a proof-of-concept. It especially does not take any rotation into account