Extracting table structured text from PDF-file

Question

I'm extracting information from PDF-file into a string. When coming across text that are structures in the pdf as tables the extracted text is then divided into the way the reader comes across the line and not cell by cell in the table row.

After reading and searching for hours I would like to get some tips on how should i approach this problem to get the string structured in the way shown bellow?

PDF- table structure

Current string:

Difenylmetandiisocyanat 9016-87-9 Acute Tox. 4; H332 >= 10 - < 20 
Skin Irrit. 2; H315 
Eye Irrit. 2; H319 
Resp. Sens. 1; H334 
Skin Sens. 1; H317 
Carc. 2; H351 
STOT SE 3; H335 
STOT RE 2; H373 
4,4'-metylendifenyldiisocyanat 101-68-8 Acute Tox. 4; H332 >= 10 - < 20 
202-966-0 Skin Irrit. 2; H315 
Eye Irrit. 2; H319 
Resp. Sens. 1; H334 
Skin Sens. 1; H317 
Carc. 2; H351 
STOT SE 3; H335 
STOT RE 2; H373

Desired structure:

Difenylmetandiisocyanat 

9016-87-9 

Acute Tox. 4; H332  
Skin Irrit. 2; H315 
Eye Irrit. 2; H319 
Resp. Sens. 1; H334 
Skin Sens. 1; H317 
Carc. 2; H351 
STOT SE 3; H335 
STOT RE 2; H373 

>= 10 - < 20 

4,4'-metylendifenyldiisocyanat 

101-68-8 
202-966-0

Acute Tox. 4; H332 
Skin Irrit. 2; H315 
Eye Irrit. 2; H319 
Resp. Sens. 1; H334 
Skin Sens. 1; H317 
Carc. 2; H351 
STOT SE 3; H335 
STOT RE 2; H373 

>= 10 - < 20

You are forgetting to give us the most important information: you talk about "table-structured text", but instead of sharing a PDF so that we can find out if the PDF is structured (in official language: to check if your PDF is properly *Tagged*), you share a screen shot. There is no way for us to check your allegation that you indeed have a table structure in the PDF. There is a huge difference between what the human eye perceives as a table structure and an actual table structure in a *Tagged PDF*. If the PDF isn't tagged, it's not structured. — Bruno Lowagie, Aug 12 '16 at 11:25
There are no tags in the file. PDF-file: [link](http://expirebox.com/files/d3426fda8d00dd0e7c6791814b5994c8.pdf) — Jonas Johansson, Aug 12 '16 at 11:35
Then the PDF isn't structured and you're asking something that isn't provided out of the box (not by any tool I know) and that requires a lot of programming work (more than can be provided on Stack Overflow). — Bruno Lowagie, Aug 12 '16 at 11:39
Then again: I look at your PDF, and it contains tags, so what are you talking about??? Why aren't you extracting the tagged structure? — Bruno Lowagie, Aug 12 '16 at 11:40

score 1 · Accepted Answer · answered Aug 12 '16 at 12:05

In your comment you say "There are no tags in the file". However, when I check the file, I clearly see the structure tree:

When a PDF is Tagged, you can easily convert it to XML:

TaggedPdfReaderTool convertor = new TaggedPdfReaderTool();
    convertor.convertToXml(
        new PdfReader("resources/pdfs/sds_w_sv_3.pdf"),
        new FileOutputStream("results/sds_w_sv_3.xml"));

This is a snippet of the resulting XML file:

<Table>
<TR>
<TH>
<Span></Span>
<P>
Best&#229;ndsdelar
 </P>
</TH>
<TH>
<Span></Span>
<P>
CAS
-
nr.
 </P>
</TH>
<TH>
<Span></Span>
<P>
Kontrollparametrar
 </P>
</TH>
<TH>
<Span></Span>
<P>
Grundval
 </P>
</TH>

This XML is an HTML-like structure that allows you to extract the table as a table. However, there must be something wrong with the way the PDF was tagged, because not all the information that is visible in the PDF is rendered to XML.

You can see this when you click on one of the first tags:

The content of the first <P> (paragraph) in the structure tree is AVSNITT 1 on page 40. What happened to the tags of the first 39 pages? This is a bad PDF file. It says that it's tagged, but at first sight it isn't properly tagged. You should ask the person who produced this file to properly tag it. Without proper tags, you will have a hard time finding a table-like structure programmatically.

Thanks this is really helpful. – Jonas Johansson Aug 12 '16 at 12:17 — Jonas Johansson, Aug 12 '16 at 12:17

Extracting table structured text from PDF-file

1 Answers1