Java - Text Extraction from PDF using OCR

Question

I have a pdf file (some part of it given below), and want to extract text from it. I have used PDFTextStream, but it doesn't work with this file. (However it worked with other file, that has simple text).

What other OCR libraries are capable of doing it?

Please Help. Thank you.

Does your pdf just contain a scanned paper copy of the original document? You can't expect 100% exact results from OCR, especially in complicated documents like this. It's a big problem that the text and lines are overlapping in many places. It makes it very hard to impossible for an algorithm to distinguish individual glyphs. — Håken Lid, Apr 16 '16 at 08:44
@HåkenLid Text and line are nor overlapping, I zoomed in so it seems so. — Dax Amin, Apr 16 '16 at 08:57
@HåkenLid Is this document too complicated for OCR? However i don't need all text. I just need to extract the Name, Address (from top section) & Past Dues/Refunds Table. — Dax Amin, Apr 16 '16 at 09:00
OCR is used on _scanned_ documents. If the file is not generated from a paper original, OCR is not relevant at all. PDF is a file format that can contain widely different kinds of content. It's meant for print and viewing on screen. There's no general method for extracting data from PDF files. — Håken Lid, Apr 16 '16 at 09:43
It might be quite possible to extract data from this specific document. But it's not possible to tell just from seeing a image preview. — Håken Lid, Apr 16 '16 at 09:46
I tried with PDFBox and it gave satisfactory results. Thank You! — Dax Amin, Apr 16 '16 at 11:13

score 5 · Answer 1 · answered Apr 16 '16 at 11:16

I tried with PDFBox and it produced satisfactory results.

Here is the code to extract text from PDF using PDFBox:

import java.io.*;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.util.*;

public class PDFTest {

 public static void main(String[] args){
 PDDocument pd;
 BufferedWriter wr;
 try {
         File input = new File("C:/BillOCR/data/bill.pdf");  // The PDF file from where you would like to extract
         File output = new File("D:/SampleText.txt"); // The text file where you are going to store the extracted data
         pd = PDDocument.load(input);
         System.out.println(pd.getNumberOfPages());
         System.out.println(pd.isEncrypted());
         pd.save("CopyOfBill.pdf"); // Creates a copy called "CopyOfInvoice.pdf"
         PDFTextStripper stripper = new PDFTextStripper();
         stripper.setStartPage(1); //Start extracting from page 3
         stripper.setEndPage(1); //Extract till page 5
         wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
         stripper.writeText(pd, wr);
         if (pd != null) {
             pd.close();
         }
        // I use close() to flush the stream.
        wr.close();
 } catch (Exception e){
         e.printStackTrace();
        }
     }
}

This will work if you have a well-formed pdf. If will not give result if some on taking a picture and save as pdf. for this you need OCR — Nitin, Jul 17 '20 at 10:20

Java - Text Extraction from PDF using OCR

1 Answers1