I'm trying to extract text from a PDF document using iTextSharp. The text I'm interested in appears beneath the "Introduction" header in the example below:
I have several hundred PDF documents that contain this "Introduction" page, normally on page five or six of the document. The paragraph always begins with an initial, such as the large P in "Physical" in the example.
In the following code, I scan the document for a page that begins with the text "Introduction," then I extract the text until the next heading ("Chapter 1"):
private static string GetIntroductionText( string filePath )
{
using ( var reader = new PdfReader( filePath ) )
{
var appending = false;
var introText = new StringBuilder();
for ( var i = 1; i <= reader.NumberOfPages; i++ )
{
var pageText = PdfTextExtractor.GetTextFromPage( reader, i );
if ( pageText.Trim().StartsWith( "Introduction" ) )
{
appending = true;
}
if ( pageText.Trim().StartsWith( "Chapter" ) )
{
break;
}
if ( appending )
{
introText.Append( pageText );
}
}
return introText.ToString();
}
}
The problem is that it doesn't extract the initial, i.e. the P in "Physical". So the text is:
hysical reality is consistent with universal laws. Where the laws do not operate, there is no reality. All of this...is unreal.
How do I get the initial at the beginning of the text?
I thought it might involve using the LocationTextExtractionStrategy
like so:
var pageText = PdfTextExtractor.GetTextFromPage( reader, i, new LocationTextExtractionStrategy() );
Unfortunately this produced the same result.