0

I would like to know if scanner in Java is able to read pdf files? If yes, how?

This is what I have, but it ain't working:

Scanner scan = new Scanner(mypdffile);
String Result = "";
while(scan.hasNext()) {
    Result += scan.nextLine();
}

2 Answers2

1

No Scanner will not work as you intend with PDF files. See this question for suggestions on how to read PDFs in Java. The TL;DR is that you probably want to use a library.

kingkupps
  • 3,284
  • 2
  • 16
  • 28
  • Thanks, so is there any other way other than PdfBox , like would buffered reader work? – Candyfloss Sep 13 '19 at 18:48
  • 3
    @Candyfloss BufferedReader surely won't help you. Readers are for character based formats, but PDF is binary. You need InputStreams, not readers. But we really do not know the inners of PDF file format to help you. – Yoshikage Kira Sep 13 '19 at 18:54
  • @kingkupps Thanks, I didn't wanna use pdfbox,etc so I ended up using streams. – Candyfloss Sep 16 '19 at 18:00
1

I ended up using streams to read from the pdf files, as I was looking for an approach without using PdfBox,etc.

dos is my dataoutputstream

     try
    {
        FileInputStream fin = new FileInputStream(mypdffile);


        int read=0;
        byte[] buf=new byte[1024];

        //read in file 
        while((read=fis.read(buf))>0) {

            dos.write(buffer,0,read);
                    dos.flush();

        }
    dos.close();



    }
    catch(IOException ex)
    {
        ex.printStackTrace();

    }
  • This will not work. The line `result+=new String(buf);` will harm your PDF, you will end up with different bytes than you had before, due to encoding. PDF is a binary file format. – Tilman Hausherr Sep 17 '19 at 09:57
  • @Tilman Hausherr , how would you suggest converting the bytes to string? – Candyfloss Sep 17 '19 at 16:35
  • If you want the text (mark, copy and paste in Adobe), then use PDFBox text extraction. You won't be able to do this without a library. If you just want to copy the (binary) bytes, then write to a ByteArrayOutputStream and then call toByteArray(). – Tilman Hausherr Sep 18 '19 at 04:09
  • Yeah that is much better. – Tilman Hausherr Oct 01 '19 at 03:15