Extract data from pdf in table format to excel/csv - Amazon textract

Question

Today, I'm trying to extract table from pdf files into an excel using Amazon Textract! Initially I thought this is going to be very simple because it was till I was working on it with Java sdk's. But now I'm stuck. I don't want to use lambda, I don't want to use S3 bucket to upload the files.

What I need and tried: extracting entire table from multiple pdf files into excel.

I don't want to read pdf into a text file and than write logic to fill the excel, I can do this in pure c#.

This is not about extracting data from table in key-value pair. This I have already tried: Key-Value Pair demo. With this, I'm able to get data from images and pdf's in a key-value format. But but but, after going through a lot of documentations I got to know, AnalyzeDocumentRequest works only with single page images/pdf's and not with pdf's containing multiple pages.

StartDocumentTextDetection I tried but again this has S3 bucket as a necessary parameter I guess and SNS, SQS, etc. Please correct me if I'm wrong.

So, Where I'm stuck:

I have lots of solution on google in Python and Java like:

Export all table data from PDF to Excel using Amazon textract

Amazon Textract without using Amazon S3

How to use the Amazon Textract with PDF files - again python and got to know something new about boto which I'm not sure about. Lol!

I want to implement this in C#.Net. I'm not getting proper documentation on this.
Obviously, I have gone through this but that's not what I want.
Not necessarily but even if the solution is without usage of S3 bucket that would be more great.

It would be really great if anyone can help me with this. Thanks in advance!

Haha you're right @KJ. My bad. Corrected it. What I meant was I know how to read/extract data from pdf. But with textract I'm finding it difficult. — StackUseR, Mar 24 '23 at 18:07

Extract data from pdf in table format to excel/csv - Amazon textract

0 Answers0

Linked