0

I am using Apache PdfBox 2.0 in order to parse a pdf file. Having some fixed strings, I was able to create a system based on:

  1. A fixed text, as a starting point
  2. The next cell/text position, or null
  3. The bottom area, to determine the height of the rectangle.

Using the starting point, I am computing the x and y (see below pic for pdf structure in PDF Box:

pdf mapping in PDF Box

Using the "next" text block (which is another fixed value, for example a field or a table header), I am determining the width of the desired region, using formula:

width = second.x - first.x 

or something similar. So, in a table, for example, knowing in advance the header names, it's easy to detect the columns. What I am trying to do (and so far fail to do so in an accurate way) is to determine the lines in a pdf table. This table sometimes contains missing values in some columns and also multiple lines values for some rows/columns. I have extended my "system" (first, next, bottom) to work dynamically with table rows, and this works great when I have "normalized" tables (e.g. no whitespaces and/or at least, no multiple line values). But it's not working with real world data, because so far I could not find a way of determining the location (x, y, width, height) of a multi-line value. Is this even possible with PDF Box? Some people suggested to convert the pdf to html first and then to parse the html instead. Is this a viable option? Has anyone worked with this library? I will try to use this next.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
hypercube
  • 958
  • 2
  • 16
  • 34
  • 1
    Parsing table data from PDF is non-trivial if you are just working with the text drawing operators. If you are lucky, the PDF is tagged and you can go off of the structure tree rather than the extracted text. Do you know if the PDFs you will be working with are tagged? – joelgeraci Aug 16 '18 at 00:02
  • I actually have the pdf file, but I don't know how to check if it's tagged or not. – hypercube Aug 16 '18 at 06:30
  • Call `document.getDocumentCatalog.getStructTreeRoot()`. (may have typo) Is it null? Then it is not tagged. – Tilman Hausherr Aug 16 '18 at 07:51
  • I have evaluated the expression: document.getDocumentCatalog().getStructureTreeRoot() and it's null. Therefore, I guess it's not tagged. – hypercube Aug 16 '18 at 12:22
  • More stuff that may or may not help: https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions , and the PrintTextLocations.java example. Also see the "Tabula" project. – Tilman Hausherr Aug 17 '18 at 07:47
  • I have good results using the current, next, bottom cells system, combined with the following: I assume that values from one column are distinct and occupy exactly one row in that column. Knowing this, I can parse the start vertical coordinate of each such value, which will be used later on to define the height of the selection (e.g. row height) for each line. – hypercube Aug 17 '18 at 14:11
  • Having a start and and vertical coordinate, and combining it with start and end horizontal coordinate, I am now able to parse the cell contents, even if that value occupy multiple lines! This can be done because the lower vertical coordinate corresponds with one of the following: a) the vertical start position of the next line (which is the same as that distinct value determined earlier) or b) the vertical start position of the bottom pdf text position. (e.g. a fixed string which comes after the table/current elements). – hypercube Aug 17 '18 at 14:11

1 Answers1

0

Like I said in my previous comments, I have found a partial solution for my issue. This is based on two things:

  1. First, I assume that one column for each table contains only distinct values which never occupy more than 1 row.
  2. Next, since I also have some fixed texts in the document, I have determined these texts coordinates and use them as a delimiter of the area which contains the text I want to extract. For example, the "current, next, bottom" system (as I call it) can contain for example: "Column name A", "Column name B", "Fixed text C" (or second row from the same table - determined based on the unique single-row values).

It is not perfect, and problems may occur if the fixed texts may occur more than once in the document. Of course, improvements can be made by filtering the correct occurrence using the vertical coordinates and so on, but for the moment, I will close this question, as it seems that this problem has no standard answer and currently there is no open source library able to extract tabular data from pdfs.

hypercube
  • 958
  • 2
  • 16
  • 34