0

I have written a program to parse the pdf into text. Im getting the output in the console,but im not able to write it into a fle. This is the code that i have done:

public class PDFTextParser {

public static void main(String args[]) throws IOException {
    PDFTextStripper pdfStripper = null;
    COSDocument cosDoc = null;
    try {


         File file = new File("1.pdf");
         PDDocument pdDoc = PDDocument.load(file);
         pdfStripper = new PDFTextStripper();
         String parsedText = pdfStripper.getText(pdDoc);
         System.out.println(parsedText);
         FileWriter out = new FileWriter("output.txt"); 
         BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
         String line = in.readLine();
         while (line!= null) {

                 out.append(line);
                 out.append("\n");
               }
        out.close();
    }catch (IOException e) {
         e.printStackTrace();}
   }
}

the output is:

Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser      parseFileObject
WARNING: Object (6:0) at offset 1013093 does not end with 'endobj' but  with '7'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (7:0) at offset 1013211 does not end with 'endobj' but with '483'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (9:0) at offset 1020280 does not end with 'endobj' but with '10'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (10:0) at offset 1020396 does not end with 'endobj' but with '15'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (15:0) at offset 1020519 does not end with 'endobj' but with '16'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (16:0) at offset 1020640 does not end with 'endobj' but with '17'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (17:0) at offset 1020756 does not end with 'endobj' but with '18'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (18:0) at offset 1020874 does not end with 'endobj' but with '19'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (19:0) at offset 1020993 does not end with 'endobj' but with '24'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (24:0) at offset 1021111 does not end with 'endobj' but with '25'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (25:0) at offset 1021228 does not end with 'endobj' but with '26'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (26:0) at offset 1021350 does not end with 'endobj' but with '27'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (27:0) at offset 1021469 does not end with 'endobj' but with '28'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (28:0) at offset 1021589 does not end with 'endobj' but with '489'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (458:0) at offset 1026684 does not end with 'endobj' but with '463'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (463:0) at offset 1026809 does not end with 'endobj' but with '464'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (464:0) at offset 1026932 does not end with 'endobj' but with '465'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (465:0) at offset 1027050 does not end with 'endobj' but with '466'
Apr 07, 2016 2:04:10 PM org.apache.pdfbox.pdfparser.COSParser parseFileObject
WARNING: Object (466:0) at offset 1027170 does not end with 'endobj' but with '495'

and the parsed pdf text is appearing in the console..but i get an empty file as output

Adam Michalik
  • 9,678
  • 13
  • 71
  • 102
  • What do you have as an output to this code? – user1314742 Apr 07 '16 at 08:45
  • 2
    what i see in your code, you need just to write `parsedText` in the file, `out.append(parsedText);` and close `out` but why you are using `...new InputStreamReader(System.in)` ? are you trying to get input from user ? – Yazan Apr 07 '16 at 09:07
  • 1
    Your program just copies `System.in` to the `output.txt` file. So to see something there, you need to provide some input to the program. – Henry Apr 07 '16 at 09:08
  • @Henry yup, also if the user types anything and hit `Enter` the app will go into an infinite loop, adding same `line` to the filewriter :) – Yazan Apr 07 '16 at 09:15
  • @Ria additional to the existing answers - the WARNINGs mean that the PDF is not according to the PDF spec (I could tell more if you upload the PDF). It could be that this is a binary file that was transferred as ascii, it could be that the producer of the PDF made a mistake. Sadly this is not uncommon. If the PDF was produced in your company, tell them. – Tilman Hausherr Apr 07 '16 at 09:52

2 Answers2

2

you have already got the text from the PDF, just write it to the file, the rest of the code trys to get input from user (ex, keyboard) you don't need it, just use below code:

String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
FileWriter out = new FileWriter("output.txt"); 
out.append(parsedText);
out.close();

//no need for this code, it reads input from user (using keyboard)
 /*
 BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
 String line = in.readLine();
 while (line!= null) {

         out.append(line);
         out.append("\n");
       }
out.close();
*/
Yazan
  • 6,074
  • 1
  • 19
  • 33
1

Did you check out this post? system-out-to-a-file-in-java

However I like his first answer

java -jar myjar.jar > output.txt

In your case would be somrthing like

java -cp <classpath>/PDFTextParser > output.txt

Hope it helps

Community
  • 1
  • 1
Laiv
  • 306
  • 3
  • 21