I am using Apache PdfBox 2.0 in order to parse a pdf file. Having some fixed strings, I was able to create a system based on:
- A fixed text, as a starting point
- The next cell/text position, or null
- The bottom area, to determine the height of the rectangle.
Using the starting point, I am computing the x and y (see below pic for pdf structure in PDF Box:
Using the "next" text block (which is another fixed value, for example a field or a table header), I am determining the width of the desired region, using formula:
width = second.x - first.x
or something similar. So, in a table, for example, knowing in advance the header names, it's easy to detect the columns. What I am trying to do (and so far fail to do so in an accurate way) is to determine the lines in a pdf table. This table sometimes contains missing values in some columns and also multiple lines values for some rows/columns. I have extended my "system" (first, next, bottom) to work dynamically with table rows, and this works great when I have "normalized" tables (e.g. no whitespaces and/or at least, no multiple line values). But it's not working with real world data, because so far I could not find a way of determining the location (x, y, width, height) of a multi-line value. Is this even possible with PDF Box? Some people suggested to convert the pdf to html first and then to parse the html instead. Is this a viable option? Has anyone worked with this library? I will try to use this next.