How to export a table in CSV using aws textract service and .net from a document (PDF/Image)

Question

I was trying to extract tables and data from a PDF file using DetectDocument (asynchronous) from AWS textract service using C#/.NET.

I was successful in data extraction but not able to figure out how to extract tables in a PDF and export in CSV file using AnalyzeDocument.

Read the AWS documentation and found CSV extraction in Python and not in .NET. Refer link:- https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html

Tried looking into Python code and replicate for .NET but was not successful.

Can you share reference URL from where you found integration code of textract in .net. I am not able to get it on google. — Varinder, Feb 17 '20 at 09:47
any solution ? i'm trying to do the same thing from python to node.. — nab, Feb 29 '20 at 14:51
@Varinder - I read the documentation and integrated, but you can check this link, it will somewhat make your job a little easier - https://github.com/aws-samples/amazon-textract-code-samples/blob/master/src-csharp/Services/TextractTextDetectionService.cs — Pranav Harshe, Mar 01 '20 at 15:39
@ChiKaLiO - I was able to get python sample code right here - https://github.com/aws-samples/amazon-textract-code-samples/tree/master/python — Pranav Harshe, Mar 01 '20 at 15:42
@ChiKaLiO, you can also refer https://docs.aws.amazon.com/textract/latest/dg/textract-dg.pdf#examples-blocks All code examples are in Python and Java. — Varinder, Mar 02 '20 at 13:40

score 1 · Accepted Answer · answered Jul 28 '20 at 07:23

We can use this piece of code, looping through relationships in the blocks returned by the GetDocumentTextAnalysis() from textract, and get all the child nodes linked to it.

var relationships = block.Relationships;
    if(relationships != null && relationships.Count > 0) {
        relationships.ForEach(r => {
            if(r.Type == "CHILD") {
                r.Ids.ForEach(id => {
                    var cell = new Cell(blocks.Find(b => b.Id == id), blocks);
                    if(cell.RowIndex > ri) {
                        this.Rows.Add(row);
                        row = new Row();
                        ri = cell.RowIndex;
                    }
                    row.Cells.Add(cell);
                });
                if(row != null && row.Cells.Count > 0)
                    this.Rows.Add(row);
            }
        });
    }

For reference - please refer link in the bottom for the code :-

https://github.com/aws-samples/amazon-textract-code-samples/blob/master/src-csharp/TextractExtensions/Table.cs

Hi Pranav. It would be really great if you could help me with this. I have gone through the github link you provided. But couldn't make out much. I have posted similar question - https://stackoverflow.com/questions/75833473/extract-data-from-pdf-in-table-format-to-excel-csv-amazon-textract — StackUseR, Mar 27 '23 at 03:53
@StackUseR check if this link is helpful - https://docs.aws.amazon.com/pdfs/textract/latest/dg/textract-dg.pdf#examples-blocks — Pranav Harshe, Apr 03 '23 at 05:59

How to export a table in CSV using aws textract service and .net from a document (PDF/Image)

1 Answers1