How would I go about parsing a pdf in c#

Question

I would like to parse a pdf and add a few tags in that pdf so I could upload it to a few forums. I was thinking of maybe using c# (or possibly python) and have it read the pdf and insert tags when needed. Where do I start with this? So far I can convert the pdf into a text file. But from there I'm stumped. Here's what I have so far:

        /*
         * Convert PDF To Text
         * *******************/

    using System;
    using System.Collections.Generic;
    using System.Drawing;
    using System.Windows.Forms;
    using System.Drawing.Printing;
    using System.IO;
    using System.Text;
    using System.ComponentModel.Design;
    using System.ComponentModel;
    using org.pdfbox.pdmodel;
    using org.pdfbox.util;

    namespace Test.iPdfToText
    {
        public partial class MainForm : Form
        {
            public MainForm()
            {
                InitializeComponent(); 
            }

            void Button1Click(object sender, EventArgs e)    
            {    
                PDDocument doc = PDDocument.load("C:\\pdftoText\\myPdfTest.pdf");
                PDFTextStripper stripper = new PDFTextStripper();
                richTextBox1.Text=(stripper.getText(doc));
            }

         }
    }

You might want to have a look at this http://stackoverflow.com/questions/1781208/is-there-any-api-in-c-sharp-or-net-to-edit-pdf-documents — prthrokz, Jan 06 '13 at 16:27
I looked at that already. It seems that most people want to change the pdf and keep the pdf as an end result. I want to end up with a text file. So essentially, I could convert my pdf to plain text and then change that. The part I'm confused on is how to insert into that text file (a richTextBox in my code) — DannyD, Jan 06 '13 at 16:43
What problem are you trying to solve? Does the code you supply get the text into the text box? — Jim Mischel, Jan 06 '13 at 16:48
yes, it does place the text into a textbox. However, I want to then search that text box and find a section of text that is say bolded. Once I find this, I then need to add tags around that text. I do I step through this word-by-word? Thanks — DannyD, Jan 06 '13 at 17:02
So you you search the text box for bold text? Or do you want to know which parts of the extracted text were bold in the PDF? Or more generally, what kind of information are you trying to extract from the PDF? — mkl, Jan 06 '13 at 23:13
I want to find which text is bold in the textbox and insert bold tags around that text. That way when I upload it to my forum the bold stuff will still be there. I think I know how to search the textbox for a specific word or group of words. But how does one search for a style and then insert tags around that? Thanks — DannyD, Jan 07 '13 at 00:57
So essentiality you don't care about the style of the text in the PDF, only about the style of the text in the text box (which is applied by some user, i assume). In that case text extraction can be attempted with a PDF library like iText (Sharp). — mkl, Jan 07 '13 at 11:25

How would I go about parsing a pdf in c#

0 Answers0