2

I am extracting text from pdf in my android application using itextpdf. It is working fine for pdf having English language, But when I tried to extract text from marathi or hindi pdf it is not extracting proper text.

It is giving result as

मत्रबध अरुण कळकणी ैंु शेणाने जधमनी सारवनू झाल्या आधण समुाकका गणुगणुत रागोळी काढू लागली. ती ं

Please help me in this to extract proper content

Manoj
  • 112
  • 11
  • What exactly is the "proper content" for those who don't know the differences? – OneCricketeer Dec 02 '16 at 07:10
  • The kana and matras are usually given a separate code entirely. And it is the then written along with the letters. You might want to check the orientation once, if there is a pattern, you'll have to rectify them to the last value. – Sanved Dec 02 '16 at 07:19
  • I just want to say that it is not giving me correct words as in the pdf. – Manoj Dec 02 '16 at 07:20
  • Please share a sample file. From your description it is entirely unclear what the "proper content" is compared to what you retrieve. So far one can only guess. Is your issue probably a duplicate of what is analyzed in [this q&a](http://stackoverflow.com/a/30804279/1729265)? In that case the PDF is simply lying to text extractors about its content. – mkl Dec 02 '16 at 07:37
  • https://drive.google.com/open?id=0B4oyXMsVV5i5UFlkRDNOY0hFOVU – Manoj Dec 02 '16 at 07:43
  • This is link for sample file @mkl – Manoj Dec 02 '16 at 07:43
  • @Manoj thanx... but Benoit was faster answering. ;) – mkl Dec 02 '16 at 11:39

1 Answers1

4

If you weren't on Android, the answer would be easy: use iText 7. The output comes out much cleaner when parsing the document with iText 7. It is still not 100% correct, but at least it looks mostly readable to me (although I'd need a native speaker to confirm). This is for page 2:

मैत्रबधं अरुण कुळकणी
मैत्रबधं

अरुण कुळकणी

ई साहित्य प्रहिष्ठान
ई साहित्य प्रहिष्ठान

The results are similar for the next page, with some minor hiccups but nothing as distorted as in iText 5.

But yeah, unfortunately you're on Android. There is as of yet no Android version for iText 7, so you'd be stuck waiting for one or trying to manually port iText to the Android platform (which will probably take forever if you're not intimately familiar with both Android and iText).

This is the iText 7 code I used:

// iText imports
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.canvas.parser.PdfTextExtractor;
public class HindiText {

    @Test
    public void go() throws Exception {
        try (PdfDocument doc = new PdfDocument(new PdfReader("input.pdf"))) {
            try (OutputStream os = new FileOutputStream("output.txt")) {
                String result = PdfTextExtractor.getTextFromPage(doc.getPage(3));
                os.write(result.getBytes(Charset.forName("UTF-16")));
            }
        }
    }
}

FYI: as of 2016-12-02 you need to build iText 7 from source (https://github.com/itext/itext7) to achieve the quality I described above. This functionality will be contained in iText 7.0.2 when it is released.

blagae
  • 2,342
  • 1
  • 27
  • 48
  • Which language did you use fo this, can you share your code, so I can get some idea – Manoj Dec 02 '16 at 09:42
  • Thanks for sharing your code, I will let you know if it works positively for me. – Manoj Dec 02 '16 at 09:49
  • I have tried your code but it is giving me same output as previous, can you tell me which jar file you used from itext 7, I have used "com.itextpdf:kernel:7.0.1" dependency. – Manoj Dec 02 '16 at 10:29
  • @Manoj you are correct, with iText 7.0.1 it still shows the erroneous files. I usually run my code with the bleeding edge code (7.0.2-SNAPSHOT), and it works a lot better. I will add a remark to that effect to the answer. – blagae Dec 02 '16 at 10:59
  • I am trying to extract text from this https://drive.google.com/open?id=0B4oyXMsVV5i5RkZhcUI1SmtPMHc pdf but it is not giving expected output, can you please help me.Why it is so? – Manoj Dec 05 '16 at 07:35
  • I'll not get in the habit of handling this kind of cases in comments, but this input file was created wrong. It has a WinAnsi encoding, which means that it will map any letter shapes from the font you're using to Windows CP 1252 (Latin) for purposes of text extraction. The Unicode information from the characters is lost and cannot be retrieved by either iText, Adobe Acrobat, or any other PDF reading tool. You will need an OCR solution to get this file to work correctly. – blagae Dec 05 '16 at 08:08