2

I have a custom table with name, firstname, place of birth and place of living in a PDF file which I want to parse through in C#. One of the simplest way of doing it would be:

using (PdfLoadedDocument document = new PdfLoadedDocument("foobar"))
{
    for (var i = 0; i < document.Pages.Count; i++)
    {
        Console.WriteLine($"============ PAGE NO. {i+1} ============");
        Console.WriteLine(document.Pages[i].ExtractText());
    }
}

But the problem is the output:

============ PAGE NO. 38 ============
John L.SmithSan Francisco5400 Baden

There's no way I can seperate this with a regex so I need a way to parse through each column of each row in order to get all the values of the customers separated. How can I parse through a table in a pdf file with syncfusion?

SuffPanda
  • 348
  • 2
  • 17
  • Have you tried using `...ExtractText(true)`? – DavidG Jan 25 '17 at 14:22
  • @DavidG sadly, `ExtractText()` doesn't take a param – SuffPanda Jan 25 '17 at 14:24
  • 1
    Are you sure? The [docs](http://help.syncfusion.com/cr/cref_files/wpf/pdf/Syncfusion.Pdf.Base~Syncfusion.Pdf.PdfPageBase~ExtractText(Boolean).html) say otherwise. – DavidG Jan 25 '17 at 14:25
  • @DavidG you're right. I used the wrong version of SyncFusion. I updated it and tried with param `true` but no difference – SuffPanda Jan 25 '17 at 14:40
  • Possible duplicate of [How to read table from PDF using itextsharp?](https://stackoverflow.com/questions/15679958/how-to-read-table-from-pdf-using-itextsharp) – bubi Jul 04 '17 at 12:21

2 Answers2

2

You will need a methods that returns you the coordinate of each character found in the pdf. Then you have some math to do (basically to compute the distance between characters) in order to know if the character is part of a word and where the word itself is located along the x-axe. It requires quite a lot of work and efforts and I didn't find such a method in syncfusion documentation.

I wrote a class which do what you want but this is for java project: PDFLayoutTextStripper (upon PDFBox)

jlink
  • 682
  • 7
  • 24
1

Syncfusion control extracting the text from PDF document based on the structure of content present in the PDF document. So, based on current implementation of Syncfusion control we cannot recognize the rows and columns present in the table of the PDF document.

Also, it is not possible to extract the text in correct order as same as the PDF document displayed using Syncfusion control since the content present in the PDF document follows fixed layout.

But we can populate the table of the PDF document in Excel using Tabula (Open source library). I have modified the Tabula java (Open Source) to achieve layout based text extraction from the PDF document based on your requirement.

Please find the sample for this implementation in below link:

http://www.syncfusion.com/downloads/support/directtrac/171585/ze/TextExtractionSample649531336

Kindly ensure the following things before executing the sample:

  1. Install Java Runtime Environment (JRE) from the below link.
    http://www.oracle.com/technetwork/java/javase/downloads/
  2. Restart your machine.
  3. Execute the above sample.

Try this and check whether it meets your requirement.