2

I'm trying to extract text from a PDF document using iTextSharp. The text I'm interested in appears beneath the "Introduction" header in the example below:

enter image description here

I have several hundred PDF documents that contain this "Introduction" page, normally on page five or six of the document. The paragraph always begins with an initial, such as the large P in "Physical" in the example.

In the following code, I scan the document for a page that begins with the text "Introduction," then I extract the text until the next heading ("Chapter 1"):

private static string GetIntroductionText( string filePath )
{
    using ( var reader = new PdfReader( filePath ) )
    {
        var appending = false;
        var introText = new StringBuilder();

        for ( var i = 1; i <= reader.NumberOfPages; i++ )
        {
            var pageText = PdfTextExtractor.GetTextFromPage( reader, i );

            if ( pageText.Trim().StartsWith( "Introduction" ) )
            {
                appending = true;
            }

            if ( pageText.Trim().StartsWith( "Chapter" ) )
            {
                break;
            }

            if ( appending )
            {
                introText.Append( pageText );
            }
        }

        return introText.ToString();
    }
}

The problem is that it doesn't extract the initial, i.e. the P in "Physical". So the text is:

hysical reality is consistent with universal laws. Where the laws do not operate, there is no reality. All of this...is unreal.

How do I get the initial at the beginning of the text?

I thought it might involve using the LocationTextExtractionStrategy like so:

var pageText = PdfTextExtractor.GetTextFromPage( reader, i, new LocationTextExtractionStrategy() );

Unfortunately this produced the same result.

Big McLargeHuge
  • 14,841
  • 10
  • 80
  • 108
  • You need to write your own `LocationTextExtractionStrategy` to solve this problem. The default extraction strategy used by iText, reorders content snippets based on the *baseline* of each text snippet. As the baseline of the first capital is much lower than the baseline of the text in the first sentence, it will be ordered "later" then the rest of the text. (However: if you PDF is a Tagged PDF, you could use the [TaggedPdfReaderTool](http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TaggedPdfReaderTool.html) in which case, iText will look at the semantic structure of the text.) – Bruno Lowagie Jan 10 '15 at 10:13
  • 1
    Is the initial drawn as text or is it an image? If you are not sure, please share asample document. – mkl Jan 10 '15 at 10:32
  • @BrunoLowagie do you know where I can find an example of such a thing? – Big McLargeHuge Jan 12 '15 at 17:33
  • The source code of iText is open. Take a look inside. We have done a similar project for a customer, but that's closed source. I am not allowed to share that code. (Surely you understand why.) – Bruno Lowagie Jan 12 '15 at 17:36
  • Do you consider other libraries or it should be done with itext only? – Hugo Moreno Jan 13 '15 at 04:59

1 Answers1

0

For the record, here's how I solved this after looking at the iText source code (specifically the LocationTextExtractionStrategy class). Keep in mind that the (0, 0) coordinate is at the bottom-left of the page, not the top-left.

public class ChunkExtractionStrategy : ITextExtractionStrategy
{
    public List<Chunk> Chunks = new List<Chunk>();

    public void BeginTextBlock()
    {}

    public void EndTextBlock()
    {}

    public string GetResultantText()
    {
        var text = new StringBuilder();

        Chunks.Sort();

        Chunk prevChunk = null;

        foreach ( var chunk in Chunks )
        {
            if ( prevChunk == null && string.IsNullOrWhiteSpace( chunk.Text ) )
            {
                // blank space at beginning of page
                continue;
            }

            if ( prevChunk != null && !chunk.SameLine( prevChunk, 20 ) )
            {
                text.Append( "\n\n" );
            }

            text.Append( chunk.Text );

            prevChunk = chunk;
        }

        return text.ToString();
    }

    public void RenderImage( ImageRenderInfo renderInfo )
    {}

    public void RenderText( TextRenderInfo renderInfo )
    {
        Chunks.Add( new Chunk
                        {
                            TopLeft = renderInfo.GetAscentLine().GetStartPoint(),
                            BottomRight = renderInfo.GetDescentLine().GetEndPoint(),
                            Text = renderInfo.GetText(),
                        } );
    }

    public class Chunk : IComparable<Chunk>
    {
        public Vector TopLeft { get; set; }

        public Vector BottomRight { get; set; }

        public string Text { get; set; }

        public int CompareTo( Chunk other )
        {
            var y1 = (int)Math.Round( TopLeft[1] );
            var y2 = (int)Math.Round( other.TopLeft[1] );

            if ( y1 < y2 )
            {
                return 1;
            }

            if ( y1 > y2 )
            {
                return -1;
            }

            var x1 = (int)Math.Round( TopLeft[0] );
            var x2 = (int)Math.Round( other.TopLeft[0] );

            if ( x1 < x2 )
            {
                return -1;
            }

            if ( x1 > x2 )
            {
                return 1;
            }

            return 0;
        }

        public bool SameLine( Chunk other, int maxDiff = 0 )
        {
            var diff = Math.Abs( TopLeft[1] - other.TopLeft[1] );

            return diff <= maxDiff;
        }
    }
}

At first, I tried something similar to this answer. But then I found myself overriding everything in the class, so it made more sense to create a new implementation.

Community
  • 1
  • 1
Big McLargeHuge
  • 14,841
  • 10
  • 80
  • 108
  • 1
    While looking at the top left is a good idea, I wouldn't use a constant maxDiff value but one dependent on the font sizes in question. – mkl Jan 13 '15 at 19:29