Bangla text extraction from PDF using iText5

Question

I'm trying to extract text from PDF in my android application. For this purpose I'm using iText5

It's extraction of English text is satisfactory. I need to extract bangla text also. For bangla it's producing the following result:

It actually should produce something like following:

This is my code for extraction:

PdfReader pdfReader = new PdfReader(filePath);
String pageText = PdfTextExtractor.getTextFromPage(pdfReader, page);

How can I extract Bangla text properly with iText5? Or is there any other library to do this smoothly?

PS: I've also tried iText7 for this. But getting RuntimeException while gradle building. So left that and again trying with iText5

What exactly is your concern? My Bangla is a bit rusty, but it seems to do a bang-up job. You just have some missing glyphs. — Kayaman, Sep 15 '20 at 06:22
@Kayaman I extract bangla texts from a pdf file and put it in another file. It's not extracting as expected. The words are not what it should be. For example where it should extract "Programming" it's extracting something else. — miq0717, Sep 15 '20 at 06:34
This usually means that the **ToUnicode** mappings in the PDF are deficient. Have you tried copy&paste from Adobe Reader? Does that extract as desired? — mkl, Sep 15 '20 at 07:39
@mkl Yes. Just tried copy&paste from Adobe reader. The characters after pasting was out of sorts. Is there any way to fix this? So that I can get the desired result? — miq0717, Sep 15 '20 at 09:04
As you don't share the PDF in question, I cannot say for sure; apparently, though, your PDF does not contain the information required for text extraction. Probably they can be added in a half-manual process, see [this answer](https://stackoverflow.com/a/39644941/1729265), but otherwise going for OCR might be your best option. — mkl, Sep 15 '20 at 09:52
@mkl Here. You can find the pdf in this link [PDF](https://drive.google.com/file/d/1XTerqzrfaK_CcZHB72esDkfVYclL7hYS/view?usp=sharing) After extracting the text I then convert it to epub. It'd be better If I can stick to iText as the images are extracting as expected with it. — miq0717, Sep 15 '20 at 10:39
I'm afraid the **ToUnicode** maps of the fonts here indeed claim that what you see as result of text extraction (iText) / copy&paste (Acrobat) is the text in the PDF. So *before* text extraction you may try and repair as shown in the link above, but as mentioned there this is not trivial. You can also try and check whether extracted characters and drawn characters relate one-to-one. In that case simple text replacement after text extraction might do. — mkl, Sep 15 '20 at 15:24
@mkl I was able to migrate to iText7. Is there any way to extract the texts with iText7 without above modifications? — miq0717, Sep 17 '20 at 07:12
OK. (There are small differences in font data retrieval between itext 5 and itext 7, so there might have been a difference here. Apparently there isn't, though.) Thus, you indeed will have to repair the **ToUnicode** maps. Or use our. — mkl, Sep 20 '20 at 07:32

Bangla text extraction from PDF using iText5

0 Answers0