1

I am using iTextSharp to extract data from pdfs. I stumbled across the following problem, depicted by the scenario below:

I created a sample excel file to illustrate. Here is what it looks like: enter image description here

I convert it to a pdf, using one of the many free online converters available out there, which generates a pdf looking like (when I generated the pdf I did not apply the styling to the excel): enter image description here

Now, using iTextSharp to extract the data from the pdf, returns me the following string as the data extracted:

enter image description here

As you can see, wrapped cell data generate new lines, where each wrapped piece of data separated by a single white space.

The problem: how does one identify, now, to which column a given piece of wrapped data belongs to ? If only iTextSharp preserved as many white spaces as columns...

In my example - how can I identify to which column does 111 belong ?


Update 1:

A similar problem occurs whenever a field has more than one word (i.e., contains white spaces). For example, considering the 1st line of the sample above:

say it looked like

---A---  ---B---  ---C---  ---D---
aaaaaaa    bb b     cccc      

iText again would generate the extraction for this one as:

aaaaaaa bb b cccc

Same problem here, in having to determine the borders of each column.


Update 2: A sample of the real pdf file I am working with: enter image description here This is how the pdf data looks like.

Veverke
  • 9,208
  • 4
  • 51
  • 95
  • If I hadn't seen the original tables with lines, I wouldn't have any idea which entries of the table without lines belong together and which don't. Thus, unless the *one of the many free online converters available out there* you used left some extra information in the file, I doubt you can properly solve your problem. If you share the PDF in question, people here can check whether such additional information are in it. – mkl Dec 31 '15 at 15:59
  • 1
    My problem is not the online converter of choice (it entered the picture solely to create the sample here). The pdf shows text lines just like the ones displayed above, and I am to extract the text saving it as an excel file. My problem is having to determine column borders out from the data record itself, as well as having to figure out to which column does a wrapped text belong to, since `iTextSharp` extracts it into a new separate line. – Veverke Dec 31 '15 at 16:42
  • Please share a representative sample file. – mkl Dec 31 '15 at 18:01

3 Answers3

7

In addition to Chris' generic answer, some background in iText(Sharp) content parsing...

iText(Sharp) provides a framework for content extraction in the namespace iTextSharp.text.pdf.parser / package com.itextpdf.text.pdf.parser. This franework reads the page content, keeps track of the current graphics state, and forwards information on pieces of content to the IExtRenderListener or IRenderListener / ExtRenderListener or RenderListener the user (i.e. you) provides. In particular it does not interpret structure into this information.

This render listener may be a text extraction strategy (ITextExtractionStrategy / TextExtractionStrategy), i.e. a special render listener which is predominantly designed to extract a pure text stream without formatting or layout information. And for this special case iText(Sharp) additionally provides two sample implementations, the SimpleTextExtractionStrategy and the LocationTextExtractionStrategy.

For your task you need a more sophisticated render listener which either

  • exports the text with coordinates (Chris in one of his answers has provided an extended LocationTextExtractionStrategy which can additionally provide positions and bounding boxes of text chunks) allowing you in additional code to analyse tabular structures; or
  • does the analysis of tabular data itself.

I do not have an example for the latter variant because generically recognizing and parsing tables is a whole project in itself. You might want to look into the Tabula project for inspiration; this project is surprisingly good at the task of table extraction.

PS: If you feel more at home with trying to extract structured content from a pure string representation of the content which nonetheless tries to reflect the original layout, you might try something like what is proposed in this answer, a variant of the LocationTextExtractionStrategy working similar to the pdftotext -layout tool; only the changes to be applied to the LocationTextExtractionStrategy are shown there.

PPS: Extraction of data from very specific PDF tables may be much easier; for example have a look at this answer which demonstrates that after some PDF analysis the specific way a given table is created might give rise to a simple custom render listener for extracting the table data. This can make sense for a single PDF with a table spanning many many pages like in the case of that answer, or it can make sense if you have many PDFs identically created by the same software.

This is why I asked for a representative sample file in a comment to your question


Concerning your comments

Still with the pdf example above, both with an implementation from scratch of ITextExtractionStrategy and with extending LocationExtractionStrategy, I see that each RenderText is called at the following chunks: Fi, el, d, A, Fi, el, d... and so on. Can this be changed?

The chunks of text you get as separate RenderText calls are not separated by accident or some random decision of iText. They are the very strings drawn separately in the page content!

In your sample "Fi", "el", "d", and "A" come in different RenderText calls because the content stream contains operations in which first "Fi" is drawn, then "el", then "d", then "A".

This may sound weird at first. A common cause for such torn up words is that PDF does not use the kerning information from fonts; to apply kerning, therefore, the PDF generating software has to insert tiny forward or backward jumps between characters which should be farther from or nearer to each other than without kerning. Thus, words often are torn apart between kerning pairs.

So this cannot be changed, you will get those pieces, and it is the job of the text extraction strategy to put them together.

By the way, there are worse PDFs, some PDF generators position each and every glyph separately, foremost such generators which predominantly build GUIs but can as a feature automatically export GUI canvasses as PDFs.

I would expect that in entering the realm of "adding my own implementation" I would have control over how to determine what is a "chunk" of text.

You can... well, you have to decide which of the incoming pieces belong together and which don't. E.g. do glyphs with the same y coordinate form a single line? Or do they form separate lines in different columns which just happen to be located next to each other.

So yes, you decide which glyphs you interpret as a single word or as content of a single table cell, but your input consists of the groups of glyphs used in the actual PDF content stream.

Not only that, in none of the interface's methods I can "spot" how/where it deals with non-text data/images - so I could intercede with the spacing issue (RenderImage is not called)

RenderImage will be called for embedded bitmap images, JPEGs etc. If you want to be informed about vector graphics, your strategy will also have to implement IExtRenderListener which provides methods ModifyPath, RenderPath and ClipPath.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank you as well for all this highly informative additions. +1 – Veverke Jan 03 '16 at 08:48
  • Added a sample of the file as you ask. – Veverke Jan 03 '16 at 09:09
  • *Added a sample of the file as you ask.* - I'd need the PDF, not a screenshot, to analyze. – mkl Jan 03 '16 at 09:39
  • As you can imagine I am reluctant in doing that due data disclosure. Will check this option, though, and let you know. – Veverke Jan 03 '16 at 09:57
  • Ok. If you have access to the original PDF creation process, you might try and feed it dummy data and share the result PDF. Do not change the PDF after creation, though, as that would change the structures i try to find. – mkl Jan 03 '16 at 10:04
  • I am having trouble trying to create my own implementation of ITextExtractionStrategy. If I simply create a class that inherits ITextExtractionStrategy and do nothing more than simply add all the interface methods (with all of them returning ThrowNotImplementedException()), why upon putting a breakpoint in within each method I do not reach any ?! (I changed of course the strategy object my extraction is instantiating - instead of Location to the new class aforementioned). – Veverke Jan 12 '16 at 09:06
  • I did not get back into this right away after I marked your reply as an answer, am trying to get back into it now, thus the delay. – Veverke Jan 12 '16 at 09:08
  • *why upon putting a breakpoint in within each method I do not reach any?* - usually this works without problem, so please provide enough additional information to allow reproducing the issue. As this essentially is a question with a different focus, please do so in a new stackoverflow question. – mkl Jan 12 '16 at 10:11
  • For some reason my "demos" project breaks somehow. Started a new project for the sole purpose of playing with implementing the interface and now I reach the breakpoints. Will keep annoying you in case I cannot help myself out there :) – Veverke Jan 13 '16 at 10:14
  • Still with the pdf example above, both with an implementation from scratch of ITextExtractionStrategy and with extending LocationExtractionStrategy, I see that each RenderText is called at the following chunks: Fi, el, d, A, Fi, el, d... and so on. Can this be changed? I would expect that in entering the realm of "adding my own implementation" I would have control over how to determine what is a "chunk" of text. Not only that, in none of the interface's methods I can "spot" how/where it deals with non-text data/images - so I could intercede with the spacing issue (`RenderImage` is not called). – Veverke Jan 13 '16 at 10:36
  • I have been having a hard time finding concrete and specific examples about all this, but seems I should be following Strategy #2 from [here](http://www.schiffhauer.com/read-text-in-a-pdf-in-c-with-itextsharp/) – Veverke Jan 13 '16 at 10:49
  • *but seems I should be following Strategy #2 from [here](http://www.schiffhauer.com/read-text-in-a-pdf-in-c-with-itextsharp/)* - well, if you know the positions and sizes of your table cells beforehand, yes, you can use that approach. But I had the impression you did *no* know them. – mkl Jan 13 '16 at 12:21
  • I do not know, but I can perhaps estimate it. It's the best approach I came across so far. What about extracting text along extracting non-text from a given rectangle and trying to determine how many white spaces I have there? Sounds good ? How/where do I have a ImageRenderFilter for that ? – Veverke Jan 13 '16 at 12:53
  • (by the way I went over many of your other answers about iTextSharp and upvoted them both because they strive for completeness/are not superficial and because you have been helping me all the way through) – Veverke Jan 13 '16 at 13:12
  • *What about extracting text along extracting non-text from a given rectangle and trying to determine how many white spaces I have there?* - I do not completely understand what you have in mind there. Please be aware that any white areas you see might be full of space character glyphs or it might contain none; there may be white rectangles or you simply see the white backdrop. There even might be some old content covered by a white rectangle on which new content is drawn, and tabular structures in the old and the new content may differ. Your render listener gets all those objects... – mkl Jan 13 '16 at 13:48
  • ... Thus a generic table extracting solution is non-trivial to implement, as mentioned earlier in my answer that is a project in its own right, not merely a side-note of PDF processing. But do try your idea, the philosopher's stone of table extraction has not yet been found, new ideas may prove to be better than current implementations. – mkl Jan 13 '16 at 13:54
  • *(by the way I went over many of your other answers about iTextSharp and upvoted them both because they strive for completeness/are not superficial and because you have been helping me all the way through)* - Thanks. But please be aware that stackoverflow software does not like massive upvoting of one single account to another in a short time, I'm afraid such upvotes will be automatically revoked... – mkl Jan 14 '16 at 09:41
  • Whenever you come across an answer helping you, you're welcome to vote ;). But please do not blindly upvote. I'm pretty much aware that I've written some good and some not so good answers... – mkl Jan 14 '16 at 14:10
5

This isn't really an answer but I needed a spot to show some things that might help you understand things.

First "conversion" from Excel, Word, PowerPoint, HTML or whatever to PDF is almost always going to be a destructive change. The destructive part is very important and it happens because you are taking data from a program that has very specific knowledge of what that data represents (Excel) and you are turning it into drawing commands in a very generic universal format (PDF) that only cares about what the data looks like, not the data itself. Unless the data is "tagged" (and it almost never is these days still) then there is no context for the drawing commands. There are no paragraphs, there are no sentences, there are no columns, rows, tables, etc. There's literally just draw this letter at x,y and draw this word at a,b.

Second, imagine you Excel file had that following data and for some reason that last column was narrower than the others when the PDF was made:

Column A | Column B | Column 
                      C
Data #1    Data #2    Data
                      #3

You and I have context so we know that the second and fourth lines are really just the continuation of the first and third lines. But since iText doesn't have any context during extraction it doesn't think like that and it sees four lines of text. In fact, since it doesn't have context it doesn't even see columns, just the lines themselves.

Third, although a very small thing you need to understand that you don't draw spaces in PDF. Imagine the three column table below:

Column A | Column B | Column C
                      Yes

If you extracted that from a PDF you'd get this data:

Column A | Column B | Column C
Yes

Inside the PDF the word "Yes" will be just drawn at a certain x coordinate that you and I consider to be under the third column and it won't have a bunch of spaces in front of it.

As I said at the beginning, this isn't much of an answer but hopefully it will explain to you the problem that you are trying to solve. If your PDF is tagged then it will have context and you can use that context during extraction. Context isn't universal, however, so there usually isn't just a magic "insert context" checkbox. Excel actually does have a checkbox (if I remember correctly) to make a tagged PDF during export and it ultimately creates a tagged PDF using HTML-like tags for tables. Very primitive but it will works. However it will be up to you to parse this context.

Chris Haas
  • 53,986
  • 12
  • 141
  • 274
0

Leaving here an alternative strategy for extracting the data - that does not solve the problem of who are spaces treated/can be treated, but gives you somewhat more control over the extraction by specifying geometric areas you want to extract text from. Taken from here.

 public static System.util.RectangleJ GetRectangle(float distanceInPixelsFromLeft, float distanceInPixelsFromBottom, float width, float height)
    {
        return new System.util.RectangleJ(
            distanceInPixelsFromLeft,
            distanceInPixelsFromBottom,
            width,
            height);
    }

      public static void Strategy2()
    {
        // In this example, I'll declare a pageNumber integer variable to
        // only capture text from the page I'm interested in
        int pageNumber = 1;

        var text = new StringBuilder();

        List<Tuple<string, int>> result = new List<Tuple<string, int>>();

        // The PdfReader object implements IDisposable.Dispose, so you can
        // wrap it in the using keyword to automatically dispose of it

        using (var pdfReader = new PdfReader("D:/Example.pdf"))
        {
            float distanceInPixelsFromLeft = 20;
            //float distanceInPixelsFromBottom = 730;
            float width = 300;
            float height = 10;

            for (int i = 800; i >= 0; i -= 10)
            {
                var rect = GetRectangle(distanceInPixelsFromLeft, i, width, height);

                var filters = new RenderFilter[1];
                filters[0] = new RegionTextRenderFilter(rect);

                ITextExtractionStrategy strategy =
                    new FilteredTextRenderListener(
                        new LocationTextExtractionStrategy(),
                        filters);

                var currentText = PdfTextExtractor.GetTextFromPage(
                    pdfReader,
                    pageNumber,
                    strategy);

                currentText =
                    Encoding.UTF8.GetString(Encoding.Convert(
                        Encoding.Default,
                        Encoding.UTF8,
                        Encoding.Default.GetBytes(currentText)));

                //text.Append(currentText);
                result.Add(new Tuple<string, int>(currentText, currentText.Length));
            }
        }

        // You'll do something else with it, here I write it to a console window
        //Console.WriteLine(text.ToString());
        foreach (var line in result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1)))
        {
            Console.WriteLine("Text: [{0}], Length: {1}", line.Item1, line.Item2);
        }
        //Console.WriteLine("", string.Join("\r\n", result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1))));

Outputs:

enter image description here

PS.: We are still left with the problem of how to deal with spaces/non text data.

Veverke
  • 9,208
  • 4
  • 51
  • 95