4

I am using the PDF iText library to convert PDF to text.

Below is my code to convert PDF to text file using Java.

public class PdfConverter {

/** The original PDF that will be parsed. */
public static final String pdfFileName = "jdbc_tutorial.pdf";
/** The resulting text file. */
public static final String RESULT = "preface.txt";

/**
 * Parses a PDF to a plain text file.
 * @param pdf the original PDF
 * @param txt the resulting text
 * @throws IOException
 */
public void parsePdf(String pdf, String txt) throws IOException {
    PdfReader reader = new PdfReader(pdf);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    PrintWriter out = new PrintWriter(new FileOutputStream(txt));

    TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
        out.println(strategy.getResultantText());
        System.out.println(strategy.getResultantText());
    }
    out.flush();
    out.close();
    reader.close();
}

/**
 * Main method.
 * @param    args    no arguments needed
 * @throws IOException
 */
public static void main(String[] args) throws IOException {
    new PdfConverter().parsePdf(pdfFileName, RESULT);
}
}

The above code works for extracting PDF to text. But my requirement is to ignore header and footer and extract only content from PDF file.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
amar
  • 137
  • 2
  • 12
  • 1
    In this case I think you should `extend SimpleTextExtractionStrategy` to not read/return the header and footer (I guess you know this by position). – Gábor Bakos Jan 12 '15 at 15:37
  • 1
    @amar You should give at least a hint how header and footr material in your PDF can be recognized. Is it an appropriately tagged PDF? Or is there a page area (with fixed coordinates for all pages) inside of which the content you want is and outside of which the cntent you don't want is? Or how else can it be recognized? – mkl Jan 13 '15 at 09:15
  • @chsdk i tried that one, that is different one.I did not get the output – amar Jan 13 '15 at 09:28
  • @mkl if pdf contains headers and footers then that is tagged pdf. – amar Jan 13 '15 at 09:36
  • Can you share a sample PDF? – mkl Jan 13 '15 at 09:51
  • This is the link http://www.bluebeam.com/us/bluebeam-university/pdf-tutorials/revu-10/headers-and-footers.pdf – amar Jan 13 '15 at 09:58
  • http://www.oracle.com/us/technologies/linux/oracle-linux-ds-1985973.pdf – amar Jan 13 '15 at 10:05
  • 1
    What do you want us to do with those links? The BlueBeam document is about headers/footers added with a specific tool that probable **marks** headers and footers as such. Do you want to remove headers and footers that are **marked** that way? If so, share such a PDF. The other PDF is a PDF from Oracle.com. It is not clear what you want to do with that file. – Bruno Lowagie Jan 13 '15 at 10:54
  • i want to remove the headers and footers present in that pdf files. – amar Jan 13 '15 at 11:09
  • 2
    Please rephrase the question so that it can actually be answered. The Bluebeam document **does not have any headers or footers!** The Oracle document is a Tagged PDF and headers and footers are defined as Artifacts. Do you want to remove those Artifacts? Repeating the question "I want to remove the headers and footers present in that pdf files" does not make sense. You only annoy people with it. – Bruno Lowagie Jan 13 '15 at 12:58
  • yes i want to remove the Artifacts from the Oracle document,please give solution. – amar Jan 13 '15 at 14:47

3 Answers3

4

Because your pdf has headers and footers, it would be marked as artifacts(if not its just a text or content placed at the position of a header or footer). If its marked as artifacts, you can extract it using ParseTaggedPdf. You can also make use of ExtractPageContentArea if ParseTaggedPdf doesn't work. You can check for a few examples related to it.

The above solution is general and depends on the file. If you really need an alternate solution, you can use apache API's like PdfBox, tika and others like PDFTextStream. The solution which i'm giving below wont work if you have to persist with iText and can't move on to other libraries. In PdfBox you can use PDFTextStripperByArea or PDFTextStripper. Look at the JavaDoc or some examples if you need to know how to use it.

Tejus Prasad
  • 6,322
  • 7
  • 47
  • 75
  • If it is not a tagged pdf then how to remove the header and footer(if just a text or content placed at the position of a header or footer). – amar Jan 17 '15 at 15:24
  • you can check this link = http://massapi.com/class/pd/PDFTextStripperByArea.html you need to define the specified the region that you need to extract – Tejus Prasad Jan 19 '15 at 15:59
0

Using IText I found one example in this site http://what-when-how.com/itext-5/parsing-pdfs-part-2-itext-5/

In this you create a rectangle that defines the bounds of the text you are getting.

PdfReader reader = new PdfReader(pdf);
PrintWriter out= new PrintWriter(new FileOutputStream(txt));
//Creating the rectangle
Rectangle rect=new Rectangle(70,80,420,500);
//creating a filter based on the rectangle
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for(int i=1;i<=reader.getNumberOfPages();i+){
    //setting the filter on the text extraction strategy
    strategy= new FilteredTextRenderListener(
      new LocationTextExtractionStrategy(),filter);
    out.println(PdfTextExtractor.getTextFromPage(reader,i,strategy));
}
out.flush();out.close();

as the web page describes this, It should work even if the pdf is not tagged.

0

You can read specific locations of a pdf file. Just mark those areas that you need to get text from and leave the areas where the header and footer are shown. I have done it and here is the complete code. itext reading specific location from pdf file runs in intellij and gives desired output but executable jar throws error

Emil Jan
  • 41
  • 2