Read data from PDF and pulled it to Excel in same structure as PDF

Question

I have to read data from PDF and then pulled back to Excel, using iTextSharp. I am able to read all the text from PDF, but the problem is I have to pull those Text data to excel in same format as it is in PDF (in PDF it is in table structure) data am getting as a series of string.

Please suggest me by which I could be separate values and table columns, I am not able to distinguish between which is a value and which is a table columns.

Below are the table structure in PDF:

---------------------------------   
|                               |
|Name|jaydeep|Age|25|Place|India|
--------------------------------
|Sex |Male   |Pin|000|Job |Yes  |
---------------------------------

So after extracting am getting all the text, now I have to populate excel with these data in the same Table structure:

----------------------------------
|Table1
 -------------------------------------
|#|ActionPlan|Description|Failure Mode|
---------------------------------------
|1|Test      |Sample test| No         |
---------------------------------------
|2|Change R  |Sample 1   |  No        |
---------------------------------------
|3|xxxxx     |Sample 2   |  Yes       |
---------------------------------------

I have user some logic and able to get data in an string[] array in below format :

BT /F3 9 Tf 1 1 1 rg 407.446 TL 297.648 364.176 Td (CCR Metrics) Tj T* ET

BT /F3 9 Tf 0.161 0.365 0.537 rg 407.446 TL 306.576 349.776 Td (#) Tj T* ET

BT /F3 9 Tf 0.161 0.365 0.537 rg 407.446 TL 375.912 349.776 Td (CCR) Tj T* ET

BT /F3 9 Tf 0.161 0.365 0.537 rg 407.446 TL 454.68 349.776 Td (Value) Tj T* ET

BT /F3 9 Tf 0.161 0.365 0.537 rg 407.446 TL 489.888 349.776 Td (Threshold) Tj T* ET

BT /F3 9 Tf 0.161 0.365 0.537 rg 407.446 TL 542.88 349.776 Td (Status) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 306.72 332.208 Td (1) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 332.208 Td (Program: ) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 322.704 Td (xxcxcxcx) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 313.2 Td (fdwdf44) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 303.696 Td (44dd) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 456.624 332.208 Td (981.80) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 505.872 332.208 Td (1152.00) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 306.72 290.16 Td (2) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 290.16 Td (Dataset: ) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 280.656 Td (P1924_w_V20) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 271.152 Td (ww55)-) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 261.648 Td (P978555520_JMC) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 456.624 290.16 Td (186.40) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 510.624 290.16 Td (512.00) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 306.72 248.112 Td (3) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 248.112 Td (RAM: ) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 238.608 Td (PddUPF_V20) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 229.104 Td (yurfcew345) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 324.648 219.6 Td (Pqsq0_JMC) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 461.376 248.112 Td (46.50) Tj T* ET

BT /F4 8.5 Tf 0 0 0 rg 384.81 TL 515.376 248.112 Td (72.00) Tj T* ET

So here in the brackets I have those PDF data which am getting from table structure of PDF.

Now the job is I have to put these data to Excel in same table structure.

PDFs have no concept of tables, just lines that happen to be near text that happens to have patterns that you think look like tables. PDFs don't even have words or paragraphs. Because of this, iTextSharp doesn't have any direct helpers methods for this. See this http://stackoverflow.com/a/2225739/231316 and this http://stackoverflow.com/a/22680222/231316. If you PDF is fairly uniform you might be able to write your own text extraction strategy to recreate your table, however. — Chris Haas, Apr 11 '14 at 17:58
If the document happens to have a structure, you can try to access it and retrieve your information from there. If it does not have structure, you might look around if the tools to make the PDF accessible work sufficiently well, and then retrieve the structure. — Max Wyss, Apr 12 '14 at 11:10
Thanks @Max Wyss, i wrote an algo by which am able to create a not exact copy of my PDF but it is near to that PDF. — Jaydeep Shil, Apr 15 '14 at 06:28

Read data from PDF and pulled it to Excel in same structure as PDF

0 Answers0