0

while extracting text from pdf file to a .txt file using itext and pdfbox jar,I am unable extract some of the special characters.below is my code

public class PDFConversionUsingPDFBox {
    public static void main(String args[]) {

        PDFParser parser = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        PDFTextStripper pdfStripper;
        COSWriter writer=null;
         FileWriter fw=null;
        String parsedText;
        String fileName = "C:/Users/sample/Desktop/test.PDF";
        File file = new File(fileName);
        try {
            FileInputStream  in=new FileInputStream("C:/Users/sample/Desktop/test.PDF");

            String outputProps = "C:/Users/sample/Desktop/Sample PDF/chapter 13/269328979.PDF"; 

            parser = new PDFParser(in);

            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(parser.getDocument());

            parsedText = pdfStripper.getText(pdDoc);
            System.out.println(parsedText);
            FileOutputStream  os=new FileOutputStream("C:/Users/sample/Desktop/testfile.txt");
            writer=new COSWriter(os);
            writer.write(pdDoc);
          } catch (Exception e) {
            e.printStackTrace();
            try {
                if (cosDoc != null)
                    cosDoc.close();
                if (pdDoc != null)
                    pdDoc.close();
            } catch (Exception e1) {
                e.printStackTrace();
            }

        }
    }
}
sweta
  • 11
  • 2
  • The problem lies almost certainly in your PDF. Not *all* text can *always* be extracted under *all* circumstances from *all* PDFs. (Rather the reverse--sometimes, some text may be extractable.) – Jongware Nov 09 '14 at 12:49
  • You do System.out.println on a Windows system by the looks of it. Does your terminal font support those special chars? That been said, please provide a PDF to reproduce your issues. – mkl Nov 09 '14 at 14:42
  • I wanted to mark this as duplicate of http://stackoverflow.com/questions/26670919/itextsharp-diacritic-chars but I can't because that question has no accepted or upvoted answer although it is the correct answer (don't you just hate it when people don't accept an answer?) – Bruno Lowagie Nov 09 '14 at 17:09
  • What happens with Adobe Reader? Can it show the characters? See also https://pdfbox.apache.org/userguide/faq.html#gibberish – Tilman Hausherr Nov 10 '14 at 09:07

0 Answers0