How to read data from pdf document and insert into database?

Question

I got PDF document from the customer. The document is 60 pages long. I need to read the data from middle of the page 49 to page 58. In ColdFusion there is cfpdf tag that allows reading the pdf documents. Here is example of what I have so far:

<cftry>
    <cfset mypdf = expandPath("./data.pdf")>
    <cfpdf action="read" source="#mypdf#" name="PDFInfo">

    <cfdump var="#PDFInfo#">

    <cfcatch type="any">
        <cfdump var="#cfcatch#">
    </cfcatch>
</cftry>

After document is dumped on the screen there are information like:

Author  [empty string]
CenterWindowOnScreen    no
ChangingDocument    Allowed
Commenting  Allowed
ContentExtraction   Allowed
CopyContent     Allowed 
PageSizes   
PDFDocumentarray
1   
PDFDocument - struct
height  792
width   612
2   
PDFDocument - struct
height  792
width   612
3   
PDFDocument - struct
height  792
width   612
4   
PDFDocument - struct
height  792
width   612

I never before used the cfpdf and this is something new for me. I tried to search on the web but couldn't find the example on how I can get the data from PDF document. Is there a good way to get the data from specific pages in the file/document? Also I guess there has to be a loop that will allow accessing individual row data. If anyone have a good example of resource for this problem please let me know. Thanks.

Documentation is a good start. Look up CFPDF and extracttext — haxtbh, Dec 10 '18 at 13:54
Searching for `cfpdf example` in Google came back with a ton of results. Including: http://www.learncfinaweek.com/week1/cfpdf/ — Cory Fail, Dec 10 '18 at 15:30
@fyroc I did see that but please tell me where you see the example on how to scrape the data/loop over the data in PDF document? I was looking for something that would help with extracting the data from specific pages. — espresso_coffee, Dec 10 '18 at 15:33
Can you edit your post about what you're trying to do? Are you trying to read the page texts? — Cory Fail, Dec 10 '18 at 15:37
@fyroc I explained that data from pages 49-58 need to be inserted in DB. How ever they seem to be in the tables. I need to pull that data, loop and clean the data for each column and then insert. I'm only wondering about the PDF part how to scrape the data. Inserting in DB is easier and I already have that code. — espresso_coffee, Dec 10 '18 at 15:40
Look at `action="extracttext` with the `addquads` attribute. Line 377 in the syntax example here: https://helpx.adobe.com/coldfusion/cfml-reference/coldfusion-tags/tags-p-q/cfpdf.html — Cory Fail, Dec 10 '18 at 15:42
@fyroc Is there a way to get the page in HTML format from PDF document? — espresso_coffee, Dec 10 '18 at 15:59
It's not currently possible with CF. Here is a good starting point. https://stackoverflow.com/questions/16785198/use-pdf-js-to-statically-convert-a-pdf-to-html?rq=1 — Cory Fail, Dec 10 '18 at 16:10
@espresso_coffee - It is a pdf file, so there is no HTML. You can only extract the text. — SOS, Dec 11 '18 at 22:20

How to read data from pdf document and insert into database?

0 Answers0