Extract text from japanese .pdf file

Asked Apr 07 '16 at 07:54

Active Apr 09 '16 at 09:33

Viewed 594 times

I'm working on an PDF. I need to extract Japanese text only from a PDF file. And then save it into my database in type of string.

I've searched on Stack Overflow and page 4 of Google but cannot find a solution.

I'm trying on pdfparser of SMALOT at github.com/smalot/pdfparser but it just shows unreadable characters (image)

Eg:

\w ��w �� /��yyy/� �Fq�J�yyy/�S�M��dyyy/�q� �Cyyy/�>; �Cyyy/��yyy/�]b;tKh�yyy/�� y /��yyyy/� �� Cyyyyy/� � a ��yyyy/�� wyyy/� a �Ugyyy/�� e{yyyy/�2�" Copyright(c)2014 Daiichikizai.,Co.,Ltd All rights reserved.

I'm using Yii framework, PHP 5.5

I tried utf-encode(), utf-decode(), mb_convert_encoding(), but nothing works.

UPDATE: I tried mb_detect_encoding() and it return UTF-8. So maybe not a encoding problem here.

Any suggestions would be deeply appreciated.

edited Apr 09 '16 at 09:33

Termininja

6,620
12
48
49

asked Apr 07 '16 at 07:54

Dang Nguyen

What kind of pdfviewer do you have? And Do you have the Eastern language support installed for your PDF viewer? – izk Apr 07 '16 at 07:59
1

I'm using windows default PDF viewer. I'm sorry but what I need is extract text data from the PDF file so I can save on my database in type of string, not just viewing it. – Dang Nguyen Apr 07 '16 at 08:09
Ok try to make your question even more accurate then. Add that comment information inside your question. – izk Apr 07 '16 at 08:12
Can you provide the PDF in question? – Jan Slabon Apr 07 '16 at 08:16
@Setasign I updated the sample pdf file. Thank you. – Dang Nguyen Apr 07 '16 at 08:22
The encoding is reported as Identity-H. That means the character indexes (`3e9`, `2FA3`, and so on) are not given in terms of Unicode, but as *specific character indexes* in the original font file. In the file, I cannot find a table to convert from these indexes to Unicode or any other regular encoding. – Jongware Apr 07 '16 at 12:17
1

Possible duplicate of [Read Japanese characters in a PDF file](http://stackoverflow.com/questions/22431215/read-japanese-characters-in-a-pdf-file) – Jongware Apr 07 '16 at 12:19
@RadLexus is correct. The mentioned script simply doesn't support predefined CMaps. – Jan Slabon Apr 07 '16 at 13:50
@Setasign & RadLexus thank you guys, can you tell me what can I do to retrieve those text please? I'm noob – Dang Nguyen Apr 08 '16 at 06:31
I'm not aware of a GPL compatible solution in PHP, sorry. – Jan Slabon Apr 08 '16 at 07:03

Extract text from japanese .pdf file

0 Answers0