Extract Bangla text from pdf that contains Embedded Subset, TrueType (CID), Identity-H encoding in C#

Asked Feb 25 '23 at 17:34

Active Feb 25 '23 at 20:13

Viewed 93 times

I want to extract bangla text from a pdf file using iTextSharp NuGet in c#. In this pdf text is like this: মোঃ শুকুকুর আলী , মোঃ জালাল মিয়া. I want to read this texts as like this. But when I read this in c# using iTextSharp. return �মাঃ জা লাল িম য়া, �মাঃ �ক ু র আলী. How to solve this problem? I'm attaching my pdf file and code here.

My controller code

using (PdfReader reader = new PdfReader(path)) 
{     
    for (int pageNo = 1; pageNo <= 1; pageNo++)     
    {         
        ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();         
        string text = PdfTextExtractor.GetTextFromPage(reader, pageNo, strategy); 
    } 

    reader.Close(); 
}

In the text variable extracted texts showing like broken.

and my pdf file link https://drive.google.com/drive/folders/1L18hGoBaSQl8xCUIXVpWUbOnhWtsSPRi?usp=share_link

and pdf font details:

enter image description here

edited Feb 25 '23 at 20:13

Dharman

30,962
25
85
135

asked Feb 25 '23 at 17:34

Hossain Mohammad

Are you sure that this example document can be freely shared? It is containing personnel information. – Glenner003 Mar 03 '23 at 10:39

Extract Bangla text from pdf that contains Embedded Subset, TrueType (CID), Identity-H encoding in C#

0 Answers0