Read specific value based on label name from PDF in C#

Question

I have an asp.net Core 2.0 C# application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name. You can see the below image I want to get the value 171857 which is Invoice number and store it in database.

I have tried below code to read the pdf using iTextSharp.

using (PdfReader reader = new PdfReader(fileName))
        {
            StringBuilder sb = new StringBuilder();

            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            for (int page = 0; page < reader.NumberOfPages; page++)
            {
                string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
                if (!string.IsNullOrWhiteSpace(text))
                {
                    sb.Append(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
                }
            }

            var pdfText = sb.ToString();
        }

In pdfText variable I will get all text content from pdf but It seems that this is not the proper way to get the Invoice number. Is there any other way to read the specific content from pdf by it's label name like we will provide label name Invoice and it will return the value 171857 as example with other 3rd party pdf reader libraries?

Any help or suggestions would be highly appreciated.

Thanks

First of all "Is there any other way to ... with other 3rd party pdf reader libraries?" clearly is a request for a library recommendation which (meanwhile) is off-topic on stack overflow (there is a software recommendations stack exchange site for that). But even ignoring that part you tagged your question both [tag:itext] and [tag:pdfsharp]; essentially you should decide which library you want to use, make a serious attempt to do so yourself, and if it doesn't work, ask a question *specific to your chosen library*. — mkl, May 16 '19 at 09:52
That being said, it is quite likely that your "label" and its "value" in the PDF merely are texts which happen to be drawn quite near to each other. Either one might be form field value, or an arbitrary annotation, or part of the page content (directly or indirectly); furthermore, either one might be drawn as a bitmap image, or a vector image, or using text drawing instructions with or without sufficient information for text extraction. Thus, please clarify their natures as an extraction approach depends there-on. — mkl, May 16 '19 at 09:58
@Ranadheer As explained in the comments above, the question is somewhat unclear and requires some clarifications. The OP failed to clarify but probably you as bounty opener an. In particular explain the nature of the label and values or present an example PDF representative here. — mkl, Sep 18 '19 at 18:36
In my job, we used OCR of Google Cloud Vision API. The PDF is transformed to a string. Them we find the pattern. In your case, I would look for a number between the key words "Invoce" and "Date". You can analyze your real text and find a better pattern. — heringer, Sep 25 '19 at 12:13

Maytham Fahmi · Answer 1 · 2019-09-25T10:08:28.757

I have helped a friend extracting similar value from pdf invoice generated by Excel arc. I have for this answer created an Excel invoice and print it as PDF file and zipped for download for testing purpose.

The next thing I do, I am using an Open Source and Free Library called PDFClown. Here is the nuget package for it.

So far so good, what I did is I scan all pdf document (for example invoice can be one page or multiple pages) add each content to a list of string.

The next step I find the index (the invoice number index could be in 10th element in list, in our case it is index 1) that refer to invoice value which I will call Tag or Label.

Hence I do not have your pdf file, I improvised and added a unique Tag called (or any other name) "INVOICE". The invoice number in this case comes after invoice tag tag. So I find the index of "INVOICE" tag and add 1 to index this is because the invoice number follow the invoice tag. This way I will pick the invoice text 0005 in this case and return it as value 5. This way you can fetch what every text/value followed by any tag scanned in our list and return it the way that you need.

So you need to play with it a bit to fit it 100% to your pdf file.

So here is my test files Excel and Pdf zipped down. Download it for your test.

Here is the code:

public class InvoiceTextExtraction
{
    private List<string> _contentList;

    public void GetValueFromPdf()
    {
        _contentList = new List<string>();
        CreatePdfContent(@"C:\temp\Invoice1.pdf");

        var index = _contentList.FindIndex(e => e == "INVOICE") + 1;
        int.TryParse(_contentList[index], out var value);
        Console.WriteLine(value);
    }


    public void CreatePdfContent(string filePath)
    {
        using (var file = new File(filePath))
        {
            var document = file.Document;

            foreach (var page in document.Pages)
            {
                Extract(new ContentScanner(page));
            }
        }
    }

    private void Extract(ContentScanner level)
    {
        if (level == null)
            return;

        while (level.MoveNext())
        {
            var content = level.Current;
            switch (content)
            {
                case ShowText text:
                {
                    var font = level.State.Font;
                    _contentList.Add(font.Decode(text.Text));
                    break;
                }
                case Text _:
                case ContainerObject _:
                    Extract(level.ChildLevel);
                    break;
            }
        }
    }
}

Input extracted from pdf file. The code scan return following elements:

INVOICE
0005

PAYMENT DUE BY:
4/19/2019
.etc
.
.
.
Tax
USD TOTAL
171857
18 september 2019

and here is the result

The code is inspired from this link.

what if I just wanted to search for a string and return a boolean if it was found? please see my post: https://stackoverflow.com/questions/66548502/file-content-search-c-sharp — StackUseR, Mar 12 '21 at 06:42

score 4 · Answer 2 · answered Sep 18 '19 at 13:06

Assuming that the invoice label and invoice number is embedded as text in PDF and not as Bitmap.

One way that I can think of doing this is by using Spire.PDF and extract location of the label, and then find the number written right below that location. This will be relatively simple if you have same template of all the PDFs you want to process.

score 0 · Answer 3 · answered Sep 18 '19 at 14:15

It isn't immediately clear from the answer whether pdfText will contain the Invoice number along with the rest of the text, but I'll assume it does. If it doesn't, then you will need OCR, which is a different beast entirely.

My first instinct would be to build a regex (^\d{6}$) in this case and try to apply it on all text on the page. If there is only one match (the invoice #), then great! Otherwise if it matches more things, you could find all occurences and look for a pattern. For example, if customers had an ID that also matched that regex, you could extract all lines which contain a matching number, and discard all lines that contain some other info (maybe all lines with a customer # would also have a date in a specific format for instance). Basically find all occurences where the regex could match, and try to find rules to exclude all the occurences you don't care about.

Read specific value based on label name from PDF in C#

3 Answers3

Linked