1

I was trying to extract tables and data from a PDF file using DetectDocument (asynchronous) from AWS textract service using C#/.NET.

I was successful in data extraction but not able to figure out how to extract tables in a PDF and export in CSV file using AnalyzeDocument.

Read the AWS documentation and found CSV extraction in Python and not in .NET. Refer link:- https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html

Tried looking into Python code and replicate for .NET but was not successful.

Pranav Harshe
  • 19
  • 2
  • 7
  • Can you share reference URL from where you found integration code of textract in .net. I am not able to get it on google. – Varinder Feb 17 '20 at 09:47
  • any solution ? i'm trying to do the same thing from python to node.. – nab Feb 29 '20 at 14:51
  • 1
    @Varinder - I read the documentation and integrated, but you can check this link, it will somewhat make your job a little easier - https://github.com/aws-samples/amazon-textract-code-samples/blob/master/src-csharp/Services/TextractTextDetectionService.cs – Pranav Harshe Mar 01 '20 at 15:39
  • 1
    @ChiKaLiO - I was able to get python sample code right here - https://github.com/aws-samples/amazon-textract-code-samples/tree/master/python – Pranav Harshe Mar 01 '20 at 15:42
  • @ChiKaLiO, you can also refer https://docs.aws.amazon.com/textract/latest/dg/textract-dg.pdf#examples-blocks All code examples are in Python and Java. – Varinder Mar 02 '20 at 13:40

1 Answers1

1

We can use this piece of code, looping through relationships in the blocks returned by the GetDocumentTextAnalysis() from textract, and get all the child nodes linked to it.

var relationships = block.Relationships;
    if(relationships != null && relationships.Count > 0) {
        relationships.ForEach(r => {
            if(r.Type == "CHILD") {
                r.Ids.ForEach(id => {
                    var cell = new Cell(blocks.Find(b => b.Id == id), blocks);
                    if(cell.RowIndex > ri) {
                        this.Rows.Add(row);
                        row = new Row();
                        ri = cell.RowIndex;
                    }
                    row.Cells.Add(cell);
                });
                if(row != null && row.Cells.Count > 0)
                    this.Rows.Add(row);
            }
        });
    }

For reference - please refer link in the bottom for the code :-

https://github.com/aws-samples/amazon-textract-code-samples/blob/master/src-csharp/TextractExtensions/Table.cs

Pranav Harshe
  • 19
  • 2
  • 7
  • Hi Pranav. It would be really great if you could help me with this. I have gone through the github link you provided. But couldn't make out much. I have posted similar question - https://stackoverflow.com/questions/75833473/extract-data-from-pdf-in-table-format-to-excel-csv-amazon-textract – StackUseR Mar 27 '23 at 03:53
  • @StackUseR check if this link is helpful - https://docs.aws.amazon.com/pdfs/textract/latest/dg/textract-dg.pdf#examples-blocks – Pranav Harshe Apr 03 '23 at 05:59
  • Hi Pranav. Gone through this document but not helpful. – StackUseR Apr 03 '23 at 08:21