iText C# Read pdf for regular expression match, extract only those pages to new pdf

Question

I'm having an issue reading an existing pdf for regular expression matches, then extracting those pages to a new pdf. I've run into some issues with this as a whole.

I've decided to clear my head and start again from scratch. I'm able to take a 3 page pdf and extract the pages individually into a new file using this code:

static void Main(string[] args)
    {
        string srcFile = @"C:\Users\steve\Desktop\original.pdf";
        string dstFile = @"C:\Users\steve\Desktop\result.pdf";
        PdfReader reader = new PdfReader(srcFile);
        Document document = new Document();
        PdfCopy copy = new PdfCopy(document, new FileStream(dstFile, FileMode.Create));
        document.Open();
        for (int page = 1; page <= reader.NumberOfPages; page++)
        {
            PdfImportedPage importedPage = copy.GetImportedPage(reader, page);
            copy.AddPage(importedPage);
        }
        document.Close();
    }

This code works because the PdfCopy instance is OUTSIDE the for loop. The issue I'm running into is that the only way I can seem to get the code (for converting to text and finding regex matches) is to place that functionality (to include the PdfCopy instance) inside the for loop.

Here's the code from my initial question: C# iTextSharp - Code overwriting instead of appending pages

You know that you can use another instance of `PdfReader` to select the pages to copy? — Paulo Soares, Feb 22 '15 at 08:49
I do, however it's specifically the PdfCopy that's giving me the issue, maybe I'm not understanding you completely?. I'm going to go through my code and post something that gets a regex match so I'm not just asking questions and not posting near complete code. — Steve H, Feb 22 '15 at 15:56
You have to select the pages with regex or whatever other way before entering the loop. Inside the loop only those pages will the added. I can't see why the PDFCopy instance would have to be created inside any loop. — Paulo Soares, Feb 22 '15 at 16:27

score 0 · Answer 1 · answered Feb 23 '15 at 12:38

As @Paulo already proposed in a comment:

You have to select the pages with regex or whatever other way before entering the loop. Inside the loop only those pages will the added.

In code this could look like this:

string srcFile = @"C:\Users\steve\Desktop\original.pdf";
string dstFile = @"C:\Users\steve\Desktop\result.pdf";

PdfReader reader = new PdfReader(srcFile);
ICollection<int> pagesToKeep = new List<int>();

for (int page = 1; page <= reader.NumberOfPages; page++)
{
    // Use the text extraction strategy of your choice here...
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(reader, page, strategy);

    // Use the content text test of your choice here...
    if (currentText.IndexOf("special") > 0)
    {
        pagesToKeep.Add(page);
    }
}

// Copy selected pages using PdfCopy
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileStream(dstFile, FileMode.Create));
document.Open();
foreach (int page in pagesToKeep)
{
    PdfImportedPage importedPage = copy.GetImportedPage(reader, page);
    copy.AddPage(importedPage);
}
document.Close();
reader.Close();

The code can be further streamlined by using a PdfStamper instead of PdfCopy. Simply replace the lines from // Copy selected pages using PdfCopy onwards by

// Copy selected pages using PdfStamper
reader.SelectPages(pagesToKeep);
PdfStamper stamper = new PdfStamper(reader, new FileStream(dstFile, FileMode.Create, FileAccess.Write));
stamper.Close();

The latter variant not only keeps the pages in question but also document level material, e.g. global JavaScript, document-level file attachments, etc. Whether or not you want that, depends on your use case.

score 0 · Answer 2 · edited May 23 '17 at 11:55

0

Thank you for your response mkl. I answered my other post but forgot about this one. I was able to use the test case provided by Chris in my other (similar) post.

C# iTextSharp - Code overwriting instead of appending pages

With some minor tweaks I was able to get the solution below to work for my project.

edited May 23 '17 at 11:55

Community

1
1

answered Feb 24 '15 at 14:57

Steve H

15
1
4

iText C# Read pdf for regular expression match, extract only those pages to new pdf

2 Answers2

Linked