I'm using PDFBox version 1.7.0 to extract text from a PDF. The classes were compiled to .NET using IKVM.NET
. I'm using the following code, where I pass in the name and path to the file:
public static String PDFText(String PDFFilePath)
{
PDDocument doc = PDDocument.load(PDFFilePath);
PDFTextStripper stripper = new PDFTextStripper();
string text = stripper.getText(doc);
doc.close();
return text;
}
The PDF has 2-columns throughout. The extraction works fairly well. However, many words are being split over to the next line by a hyphen where they should be preserved as a complete word.
For example, the word "becoming" changes to "becom-
ing" as do many other words.
Is there a way to prevent PDFBox from randomly splitting a word with a dash "-" or hyphen and displaying part of the word on one line while carrying the rest of it to the next line?
I saw an article on stackoverflow that dealt with inserting white spaces between words randomly, i.e. PDFBox adding white spaces within words.
However, my issue is PDFBox splitting with dashes or hyphens.
I also saw reference to a method called charactersByArticle
, which was expressly for twocolumn PDFs and I thought perhaps this might render the extracted text correctly. However, I have not found a working example of how to use this method, just teaser references to it.
If the charactersByArticle
method wouldn't prevent this, I would even consider Regex if someone could provide a good working example of using this in conjunction with my PDFTextStripper method above. Thank you in advance.