3

So can anyone find a way to read out just the numbers in a column of a .PDF file programmatically? In other words, is it possible to drop a PDF file and make something that sucks it up, reads out all of a column?

The column is of the following format:

401232111555713

Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
  • See this http://stackoverflow.com/a/7515625/231316. @Jared's post is a great place to start but just remember that PDFs don't store tables, only things that happen to look like tables. – Chris Haas Jul 11 '13 at 19:03
  • @ChrisHaas I realize that, but since I just want a single column Jared's answer worked perfectly! thanks –  Jul 11 '13 at 19:04

2 Answers2

4

The following code will open and read any PDF into a string using iTextSharp:

public static string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

From there you can simply run some REGEX to get the column using the pattern you laid out:

string text = ReadPdfFile(@"path\to\pdf\file.pdf");
Regex regex = new Regex(@"(?<number>\d{15})");
List<string> results = new List<string>();
foreach (Match m in regex.Matches(text))
{
    results.Add(m.Groups["number"].Value);
}
Jared
  • 360
  • 3
  • 11
  • 1
    This uses the `SimpleTextExtractionStrategy` --- depending on the use case in question, you may need a different text extraction strategy, e.g. the `LocationTextExtractionStrategy.` – mkl Jul 12 '13 at 07:20
0

You'll need to use some PDF processing library. Here's a SO link that has discussion on that topic:

Reading PDF in C#

Community
  • 1
  • 1
Curtis Rutland
  • 776
  • 4
  • 12