1

I've seen many posts that have helped me get to where I am, I'm new to programming. My intention is to get the files within the directory "sourceDir" and look for a Regex Match. When it finds a Match, I want to create a new file with the Match as the name. If the code finds another file with the same Match (the file already exists) then create a new page within that document.

Right now the code works, however instead of adding a new page, it overwrites the first page of the document. NOTE: Every document in the directory is only one page!

string sourceDir = @"C:\Users\bob\Desktop\results\";
string destDir = @"C:\Users\bob\Desktop\results\final\";
string[] files = Directory.GetFiles(sourceDir);
foreach (string file in files)
    {
       using (var pdfReader = new PdfReader(file.ToString()))
            {
                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    var text = new StringBuilder();

                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    var currentText = 
                    PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.Append(currentText);

                    Regex reg = new Regex(@"ABCDEFG");
                    MatchCollection matches = reg.Matches(currentText);

                    foreach (Match m in matches)
                    {
                        string newFile = destDir + m.ToString() + ".pdf";

                        if (!File.Exists(newFile))
                        {
                            using (PdfReader reader = new PdfReader(File.ReadAllBytes(file)))
                            {
                                using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
                                {
                                    using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.Create)))
                                    {
                                        var importedPage = copy.GetImportedPage(reader, page);
                                        doc.Open();
                                        copy.AddPage(importedPage);
                                        doc.Close();
                                    }
                                }
                            }
                        }
                        else
                        {
                            using (PdfReader reader = new PdfReader(File.ReadAllBytes(newFile)))
                            {
                                using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
                                {
                                    using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.OpenOrCreate)))
                                    {
                                        var importedPage = copy.GetImportedPage(reader, page);
                                        doc.Open();
                                        copy.AddPage(importedPage);
                                        doc.Close();
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
Steve H
  • 15
  • 1
  • 4
  • 1
    You seem to constantly overwrite your file. You should create the `PdfCopy` instance in the outer loop. Actually, I don't understand your code. It doesn't seem to match with what you want. Can you document your code (e.g. by adding comments to it that describe what you want to do)? – Bruno Lowagie Jan 12 '15 at 16:30
  • Do you mean PdfCopy, PdfReader, Document? Do I even need the PdfReader at this point? I'm trying to add the file (that has a regex match) as a second, third etc. page in the final document. – Steve H Jan 12 '15 at 16:48
  • My objective is to add a page to the final document instead of overwriting it – Steve H Jan 12 '15 at 16:51
  • OK, but if I understand your code correctly, you are current creating single page PDFs using `PdfCopy`, throwing old versions away every time you encounter a new page that needs to be added. That doesn't make sense, does it? Move `Document` and `PdfCopy` out of the inner loop. – Bruno Lowagie Jan 12 '15 at 16:57
  • If I'm seeing this correctly, my guess is I need to do something different with the copy.AddPage(importedPage); line in the else statement. – Steve H Jan 12 '15 at 16:57
  • I think you're not seeing it correctly. Why don't you follow my advice? – Bruno Lowagie Jan 12 '15 at 16:59

2 Answers2

2

Bruno did a great job explaining the problem and how to fix it but since you've said that you are new to programming and you've further posted a very similar and related question I'm going to go a little deeper to hopefully help you.

First, let's write down the knowns:

  1. There's a directory full of PDFs
  2. Each PDF has only a single page

Then the objectives:

  1. Extract the text of each PDF
  2. Compare the extracted text with a pattern
  3. If there's a match, then using the match for a file name do one of:
    1. If a file exists append the source PDF to it
    2. If there isn't a match, create a new file with the PDF

There's a couple of things that you need to know before proceeding. You tried to work in "append mode" by using FileMode.OpenOrCreate. It was a good guess but incorrect. The PDF format has both an beginning and an end, so "start here" and "end here". When you attempt to append another PDF (or anything for that matter) to an existing file you are just writing past the "end here" section. At best, that's junk data that gets ignored but more likely you'll end up with a corrupt PDF. The same is true of almost any file format. Two XML files concatenated is invalid because an XML document can only have one root element.

Second but related, iText/iTextSharp cannot edit existing files. This is very important. It can, however, create brand new files that happen to have the exact or possibly modified versions of other files. I don't know if I can stress how important this is.

Third, you are using a line that get's copied over and over again but is very wrong and actually can corrupt your data. For why it is bad, read this.

currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));

Fourth, you are using RegEx which is an overly complicated way to perform a search. Maybe the code that you posted was just a sample but if it wasn't I would recommend just using currentText.Contains("") or if you need to ignore case currentText.IndexOf( "", StringComparison.InvariantCultureIgnoreCase ). For the benefit of the doubt, the code below assumes you have a more complex RegEx.

With all that, below is a full working example that should walk you through everything. Since we don't have access to your PDFs, the second section actually creates 100 sample PDFs with our search terms occasionally added to them. Your real code obviously wouldn't do this but we need common ground to work with you on. The third section is the search and merge feature that you are trying to do. Hopefully the comments in the code explain everything.

/**
 * Step 1 - Variable Setup
 */

//This is the folder that we'll be basing all other directory paths on
var workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);

//This folder will hold our PDFs with text that we're searching for
var folderPathContainingPdfsToSearch = Path.Combine(workingFolder, "Pdfs");

var folderPathContainingPdfsCombined = Path.Combine(workingFolder, "Pdfs Combined");

//Create our directories if they don't already exist
System.IO.Directory.CreateDirectory(folderPathContainingPdfsToSearch);
System.IO.Directory.CreateDirectory(folderPathContainingPdfsCombined);

var searchText1 = "ABC";
var searchText2 = "DEF";

/**
 * Step 2 - Create sample PDFs
 */

//Create 100 sample PDFs
for (var i = 0; i < 100; i++) {
    using (var fs = new FileStream(Path.Combine(folderPathContainingPdfsToSearch, i.ToString() + ".pdf"), FileMode.Create, FileAccess.Write, FileShare.None)) {
        using (var doc = new Document()) {
            using (var writer = PdfWriter.GetInstance(doc, fs)) {
                doc.Open();

                //Add a title so we know what page we're on when we combine
                doc.Add(new Paragraph(String.Format("This is page {0}", i)));

                //Add various strings every once in a while.
                //(Yes, I know this isn't evenly distributed but I haven't
                // had enough coffee yet.)
                if (i % 10 == 3) {
                    doc.Add(new Paragraph(searchText1));
                } else if (i % 10 == 6) {
                    doc.Add(new Paragraph(searchText2));
                } else if (i % 10 == 9) {
                    doc.Add(new Paragraph(searchText1 + searchText2));
                } else {
                    doc.Add(new Paragraph("Blah blah blah"));
                }

                doc.Close();
            }
        }
    }
}

/**
 * Step 3 - Search and merge
 */


//We'll search for two different strings just to add some spice
var reg = new Regex("(" + searchText1 + "|" + searchText2 + ")");

//Loop through each file in the directory
foreach (var filePath in Directory.EnumerateFiles(folderPathContainingPdfsToSearch, "*.pdf")) {
    using (var pdfReader = new PdfReader(filePath)) {
        for (var page = 1; page <= pdfReader.NumberOfPages; page++) {

            //Get the text from the page
            var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new SimpleTextExtractionStrategy());

            currentText.IndexOf( "",  StringComparison.InvariantCultureIgnoreCase )



            //DO NOT DO THIS EVER!! See this for why https://stackoverflow.com/a/10191879/231316
            //currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));

            //Match our pattern against the extracted text
            var matches = reg.Matches(currentText);

            //Bail early if we can
            if (matches.Count == 0) {
                continue;
            }

            //Loop through each match
            foreach (var m in matches) {

                //This is the file path that we want to target
                var destFile = Path.Combine(folderPathContainingPdfsCombined, m.ToString() + ".pdf");

                //If the file doesn't already exist then just copy the file and move on
                if (!File.Exists(destFile)) {
                    System.IO.File.Copy(filePath, destFile);
                    continue;
                }

                //The file exists so we're going to "append" the page
                //However, writing to the end of file in Append mode doesn't work,
                //that would be like "add a file to a zip" by concatenating two
                //two files. In this case, we're actually creating a brand new file
                //that "happens" to contain the original file and the matched file.
                //Instead of writing to disk for this new file we're going to keep it
                //in memory, delete the original file and write our new file
                //back onto the old file
                using (var ms = new MemoryStream()) {

                    //Use a wrapper helper provided by iText
                    var cc = new PdfConcatenate(ms);

                    //Open for writing
                    cc.Open();

                    //Import the existing file
                    using (var subReader = new PdfReader(destFile)) {
                        cc.AddPages(subReader);
                    }

                    //Import the matched file
                    //The OP stated a guarantee of only 1 page so we don't
                    //have to mess around with specify which page to import.
                    //Also, PdfConcatenate closes the supplied PdfReader so
                    //just use the variable pdfReader.
                    using (var subReader = new PdfReader(filePath)) {
                        cc.AddPages(subReader);
                    }

                    //Close for writing
                    cc.Close();

                    //Erase our exisiting file
                    File.Delete(destFile);

                    //Write our new file
                    File.WriteAllBytes(destFile, ms.ToArray());
                }
            }
        }
    }
}
Community
  • 1
  • 1
Chris Haas
  • 53,986
  • 12
  • 141
  • 274
  • Thank you SO much for all of the information. I do agree Bruno was absolutely right, I just lacked an understanding of how pdfs are created. Your explanation of PDF vs XML made perfect sense to me. I was able to get this to work in the test case you provided as well as in my actual project. – Steve H Feb 22 '15 at 23:19
0

I'll write this in pseudo code.

You do something like this:

// loop over different single-page documents
for () {
    // introduce a condition
    if (condition == met) {
        // create single-page PDF
        new Document();
        new PdfCopy();
        document.Open();
        copy.add(singlePage);
        document.Close();
    }
}

This means that you are creating a single-page PDF every time the condition is met. Incidentally, you're overwriting existing files many times.

What you should do, is something like this:

// Create a document with as many pages as times a condition is met
new Document();
new PdfCopy();
document.Open();
// loop over different single-page documents
for () {
    // introduce a condition
    if (condition == met) {
        copy.addPage(singlePage);
    }
}
document.Close();

Now you are possibly adding more than one page to the new document you are creating with PdfCopy. Be careful: an exception can be thrown if the condition is never met.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • I've tried your suggestion (or what I believe you to be suggesting) and I'm getting the same results. I think I'm lacking an understanding of iText as a whole. I don't want to ask you to break it completely down for me, I know you have a book for that. Thanks for your help! – Steve H Jan 13 '15 at 19:11