Read a PDF and find a specific column to add to a list

Question

So can anyone find a way to read out just the numbers in a column of a .PDF file programmatically? In other words, is it possible to drop a PDF file and make something that sucks it up, reads out all of a column?

The column is of the following format:

401232111555713

See this http://stackoverflow.com/a/7515625/231316. @Jared's post is a great place to start but just remember that PDFs don't store tables, only things that happen to look like tables. — Chris Haas, Jul 11 '13 at 19:03
@ChrisHaas I realize that, but since I just want a single column Jared's answer worked perfectly! thanks — , Jul 11 '13 at 19:04

score 4 · Accepted Answer · edited Jul 11 '13 at 19:02

The following code will open and read any PDF into a string using iTextSharp:

public static string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

From there you can simply run some REGEX to get the column using the pattern you laid out:

string text = ReadPdfFile(@"path\to\pdf\file.pdf");
Regex regex = new Regex(@"(?<number>\d{15})");
List<string> results = new List<string>();
foreach (Match m in regex.Matches(text))
{
    results.Add(m.Groups["number"].Value);
}

This uses the `SimpleTextExtractionStrategy` --- depending on the use case in question, you may need a different text extraction strategy, e.g. the `LocationTextExtractionStrategy.` — mkl, Jul 12 '13 at 07:20

score 0 · Answer 2 · edited May 23 '17 at 10:25

0

You'll need to use some PDF processing library. Here's a SO link that has discussion on that topic:

Reading PDF in C#

edited May 23 '17 at 10:25

Community

1
1

answered Jul 11 '13 at 18:57

Curtis Rutland

776
4
12

Read a PDF and find a specific column to add to a list

2 Answers2