1

I have a PDF file that was produced with iText and created with JasperReports (I don't know if it's relevant) and I was wondering if I can find some API or anything to see the structure because I need to extract text from it.

  • I tried with iText, PDFBox and other Java libraries but I only get text line by line and that's not what I need.
  • I also tried conversion in HTML, XML, DOM but I get the same result with text extraction, no structure parsed.
  • If I try to open it as DOCX I see that Word recognize sort of structure, for example an area that looks like a table in PDF, after conversion in DOCX it is actually a table.

I need to understand how the PDF was created, if this is possible. I know that working with PDF's is not easy, but I need to start with something useful. Thanks!

U880D
  • 8,601
  • 6
  • 24
  • 40
Tobi
  • 45
  • 2
  • 12
  • Since you are not interested in the text / content line by line, can you provide more details on what kind of information or structure you are interested in? – U880D Jul 11 '18 at 08:23
  • 1
    A typical PDF does not store any more information than its plain text "line by line" – and even *that* is not a requirement nor a guarantee. (One or two characters at a time, at any x and y position, is not unusual at all.) You may get lucky with your limited workflow … but do your research and verify with a PDF object inspector if your workflow indeed *does* store this meta-information. If it doesn't, then no tool can help you. – Jongware Jul 11 '18 at 08:32
  • @U880D I have blocks of text divided by a bold horizontal line (3 per page). In the first one are info that I am not interested in, in the second one I have some info divided on two columns and in the last one I have some sort of table with four columns and ~ 10 rows, and that is the info that I need, to extract this table as a table structure to get text from it. – Tobi Jul 11 '18 at 08:38
  • @U880D another problem. I tried to identify what I need by coordinates(I saw that the info is stored at the same coordinates for each kind of this PDF and I made an algorithm for that ) but If the producer of the PDF changes something(adding a new line for example), my algorithm is done... – Tobi Jul 11 '18 at 08:49
  • You may start with this thread about [Structure of a PDF file](https://stackoverflow.com/questions/88582/structure-of-a-pdf-file). – U880D Jul 11 '18 at 08:55
  • Can you share the file in question? – mkl Jul 11 '18 at 14:39
  • The Datalogics PDF Java Toolkit does an excellent job of inferring structure and extracting the text from PDF files that were created *without* structure into a List of Paragraph objects, which is a List of Sentence objects, composed of Word objects in reading order. You can use the Word objects to get the word coordinates as well. – joelgeraci Jul 11 '18 at 15:11
  • Who is generating the PDF from JasperReports/iText? If it is you, do you not have the structure info you are looking for at the beginning? Essentially it is unclear why are trying to analyze a PDF, which is very difficult, if on the other hand you already have the structure beforehand. – Ryan Jul 11 '18 at 18:26
  • @Ryan, it is not generated by me, I download it from my bank account. It's a transaction report document. I only have the PDF and nothing else.... – Tobi Jul 12 '18 at 14:48

2 Answers2

1

PDFTron PDFGenie can do full semantic table and paragraph extraction from a PDF file. It can generate a reflowable HTML file containing all the appropriate HTML tags for tables and paragraphs.

See this blog for more details. https://www.pdftron.com/blog/parsing-extraction/table-extraction-and-pdf-to-xml-with-pdfgenie/#a-idpart7aevaluating-accuracy-of-pdf-table-recognition

You can download Windows/macOS/Linux PDFGenie command line tool here. https://www.pdftron.com/downloads/linux

Ryan
  • 2,473
  • 1
  • 11
  • 14
0

One more option, we can extract from Aspose PDF also, if you want look into the below link

https://blog.aspose.com/2018/02/28/extract-text-by-paragraphs-and-convert-files-to-pdf-with-aspose.pdf/

srinivas
  • 1
  • 3
  • 1
    thank you. I used ASPOSE PDF for conversion in DOCX(I forgot to mention id) because a DOCX helps me more than a PDF, but conversion does not keep any structure. As I said, if I open it from Adobe Reader as DOCX, I have some sort of formatting that is useful, but conversion does not keep any of it...there are just paragraphs. – Tobi Jul 11 '18 at 08:54