0

I'm working on an PDF. I need to extract Japanese text only from a PDF file. And then save it into my database in type of string.

I've searched on Stack Overflow and page 4 of Google but cannot find a solution.

I'm trying on pdfparser of SMALOT at github.com/smalot/pdfparser but it just shows unreadable characters (image)

Eg:

\w ���w ��� � /�����yyy/� �Fq�J�yyy/�S�M��dyyy/�q� �Cyyy/�>; �Cyyy/��������yyy/�]b;tKh�yyy/��� ����y /����yyyy/� �� �Cyyyyy/� � a ���yyyy/���� wyyy/� a �Ugyyy/����� e{yyyy/�2�" Copyright(c)2014 Daiichikizai.,Co.,Ltd All rights reserved.

I'm using Yii framework, PHP 5.5

I tried utf-encode(), utf-decode(), mb_convert_encoding(), but nothing works.

UPDATE: I tried mb_detect_encoding() and it return UTF-8. So maybe not a encoding problem here.

Any suggestions would be deeply appreciated.

Termininja
  • 6,620
  • 12
  • 48
  • 49
  • What kind of pdfviewer do you have? And Do you have the Eastern language support installed for your PDF viewer? – izk Apr 07 '16 at 07:59
  • 1
    I'm using windows default PDF viewer. I'm sorry but what I need is extract text data from the PDF file so I can save on my database in type of string, not just viewing it. – Dang Nguyen Apr 07 '16 at 08:09
  • Ok try to make your question even more accurate then. Add that comment information inside your question. – izk Apr 07 '16 at 08:12
  • Can you provide the PDF in question? – Jan Slabon Apr 07 '16 at 08:16
  • @Setasign I updated the sample pdf file. Thank you. – Dang Nguyen Apr 07 '16 at 08:22
  • The encoding is reported as Identity-H. That means the character indexes (`3e9`, `2FA3`, and so on) are not given in terms of Unicode, but as *specific character indexes* in the original font file. In the file, I cannot find a table to convert from these indexes to Unicode or any other regular encoding. – Jongware Apr 07 '16 at 12:17
  • 1
    Possible duplicate of [Read Japanese characters in a PDF file](http://stackoverflow.com/questions/22431215/read-japanese-characters-in-a-pdf-file) – Jongware Apr 07 '16 at 12:19
  • @RadLexus is correct. The mentioned script simply doesn't support predefined CMaps. – Jan Slabon Apr 07 '16 at 13:50
  • @Setasign & RadLexus thank you guys, can you tell me what can I do to retrieve those text please? I'm noob – Dang Nguyen Apr 08 '16 at 06:31
  • I'm not aware of a GPL compatible solution in PHP, sorry. – Jan Slabon Apr 08 '16 at 07:03

0 Answers0