0

I'm trying to extract text from PDF in my android application. For this purpose I'm using iText5

It's extraction of English text is satisfactory. I need to extract bangla text also. For bangla it's producing the following result:

enter image description here

It actually should produce something like following:

enter image description here

This is my code for extraction:

PdfReader pdfReader = new PdfReader(filePath);
String pageText = PdfTextExtractor.getTextFromPage(pdfReader, page);

How can I extract Bangla text properly with iText5? Or is there any other library to do this smoothly?

PS: I've also tried iText7 for this. But getting RuntimeException while gradle building. So left that and again trying with iText5

miq0717
  • 103
  • 1
  • 8
  • What exactly is your concern? My Bangla is a bit rusty, but it seems to do a bang-up job. You just have some missing glyphs. – Kayaman Sep 15 '20 at 06:22
  • @Kayaman I extract bangla texts from a pdf file and put it in another file. It's not extracting as expected. The words are not what it should be. For example where it should extract "Programming" it's extracting something else. – miq0717 Sep 15 '20 at 06:34
  • 1
    This usually means that the **ToUnicode** mappings in the PDF are deficient. Have you tried copy&paste from Adobe Reader? Does that extract as desired? – mkl Sep 15 '20 at 07:39
  • @mkl Yes. Just tried copy&paste from Adobe reader. The characters after pasting was out of sorts. Is there any way to fix this? So that I can get the desired result? – miq0717 Sep 15 '20 at 09:04
  • As you don't share the PDF in question, I cannot say for sure; apparently, though, your PDF does not contain the information required for text extraction. Probably they can be added in a half-manual process, see [this answer](https://stackoverflow.com/a/39644941/1729265), but otherwise going for OCR might be your best option. – mkl Sep 15 '20 at 09:52
  • @mkl Here. You can find the pdf in this link [PDF](https://drive.google.com/file/d/1XTerqzrfaK_CcZHB72esDkfVYclL7hYS/view?usp=sharing) After extracting the text I then convert it to epub. It'd be better If I can stick to iText as the images are extracting as expected with it. – miq0717 Sep 15 '20 at 10:39
  • I'm afraid the **ToUnicode** maps of the fonts here indeed claim that what you see as result of text extraction (iText) / copy&paste (Acrobat) is the text in the PDF. So *before* text extraction you may try and repair as shown in the link above, but as mentioned there this is not trivial. You can also try and check whether extracted characters and drawn characters relate one-to-one. In that case simple text replacement after text extraction might do. – mkl Sep 15 '20 at 15:24
  • @mkl I was able to migrate to iText7. Is there any way to extract the texts with iText7 without above modifications? – miq0717 Sep 17 '20 at 07:12
  • I don't know. Have you tried? – mkl Sep 17 '20 at 17:45
  • @mkl yes. no luck – miq0717 Sep 20 '20 at 05:45
  • OK. (There are small differences in font data retrieval between itext 5 and itext 7, so there might have been a difference here. Apparently there isn't, though.) Thus, you indeed will have to repair the **ToUnicode** maps. Or use our. – mkl Sep 20 '20 at 07:32

0 Answers0