how to use imagick annotateImage for chinese text?

Question

I need to annotate an image with Chinese Text and I am using Imagick library right now.

An example of a Chinese Text is

这是中文

The Chinese Font file used is this

The file originally is named 华文黑体.ttf

it can also be found in Mac OSX under /Library/Font

I have renamed it to English STHeiTi.ttf make it easier to call the file in php code.

In particular the Imagick::annotateImage function

I also am using the answer from "How can I draw wrapped text using Imagick in PHP?".

The reason why I am using it is because it is successful for English text and application needs to annotate both English and Chinese, though not at the same time.

The problem is that when I run the annotateImage using Chinese text, I get annotation that looks like 罍

Code included here

Chinese text? What about first creating a graphic of the chinese symbols and then merge it onto the image? — hakre, Jun 19 '12 at 13:04
Well, for each chinese character, create one image that displays it. Then put these images together for example. Might not be the best method, but might spare you the problem to actually use some chinese font. — hakre, Jun 19 '12 at 13:16
Also I think the context of your question it is interesting to know which encoding your Chinese text is using and to know which encodings the Imagick library supports. — hakre, Jun 19 '12 at 13:17
Your point about encoding has lead me to this answer. http://stackoverflow.com/a/10546631/80353 I am going to give that one a shot. — Kim Stacks, Jun 20 '12 at 05:33
Can you add a Chinese text string and a Chinese TTF font to your question? — hakre, Jun 20 '12 at 08:39
@SalmanA I cannot use Imagick annotateImage for Chinese Text To hakre I have uploaded the font, etc. — Kim Stacks, Jun 20 '12 at 15:05
@kimsia: can you post some code, specially the portion where you define/read/input the text that you want to render on the image. I think this could be an encoding problem. — Salman A, Jun 20 '12 at 15:35
@kimsia: using `explode` on utf8 encoded data will probably break the encoding. — Salman A, Jun 22 '12 at 08:22
@Salman A: `explode(' ', ...)` is fine in UTF-8 as well as in ISO-8859-n, that is not the problem. — Walter Tross, Jun 26 '12 at 12:48
@WalterTross Sorry I have not got a chance to try it out. Other tricky programming issues to resolve. Will do so in another few weeks time. Your enthusiasm is wonderful. I will respond to your new wordwrapannotation function. — Kim Stacks, Jul 08 '12 at 00:50

score 7 · Answer 1 · edited May 23 '17 at 10:29

The problem is you are feeding imagemagick the output of a "line splitter" (wordWrapAnnotation), to which you are utf8_decodeing the text input. This is wrong for sure, if you are dealing with Chinese text. utf8_decode can only deal with UTF-8 text that CAN be converted to ISO-8859-1 (the most common 8-bit extension of ASCII).

Now, I hope that you text is UTF-8 encoded. If it is not, you might be able to convert it like this:

$text = mb_convert_encoding($text, 'UTF-8', 'BIG-5');

or like this

$text = mb_convert_encoding($text, 'UTF-8', 'GB18030'); // only PHP >= 5.4.0

(in your code $text is rather $text1 and $text2).

Then there are (at least) two things to fix in your code:

pass the text "as is" (without utf8_decode) to wordWrapAnnotation,
change the argument of setTextEncoding from "utf-8" to "UTF-8" as per specs

I hope that all variables in your code are initialized in some missing part of it. With the two changes above (the second one might not be necessary, but you never know...), and with the missing parts in place, I see no reason why your code should not work, unless your TTF file is broken or the Imagick library is broken (imagemagick, on which Imagick is based, is a great library, so I consider this last possibility rather unlikely).

EDIT:

Following your request, I update my answer with

a) the fact that setting mb_internal_encoding('utf-8') is very important for the solution, as you say in your answer, and

b) my proposal for a better line splitter, that works acceptably for western languages and for Chinese, and that is probably a good starting point for other languages using Han logograms (Japanese kanji and Korean hanja):

function wordWrapAnnotation(&$image, &$draw, $text, $maxWidth)
{
   $regex = '/( |(?=\p{Han})(?<!\p{Pi})(?<!\p{Ps})|(?=\p{Pi})|(?=\p{Ps}))/u';
   $cleanText = trim(preg_replace('/[\s\v]+/', ' ', $text));
   $strArr = preg_split($regex, $cleanText, -1, PREG_SPLIT_DELIM_CAPTURE |
                                                PREG_SPLIT_NO_EMPTY);
   $linesArr = array();
   $lineHeight = 0;
   $goodLine = '';
   $spacePending = false;
   foreach ($strArr as $str) {
      if ($str == ' ') {
         $spacePending = true;
      } else {
         if ($spacePending) {
            $spacePending = false;
            $line = $goodLine.' '.$str;
         } else {
            $line = $goodLine.$str;
         }
         $metrics = $image->queryFontMetrics($draw, $line);
         if ($metrics['textWidth'] > $maxWidth) {
            if ($goodLine != '') {
               $linesArr[] = $goodLine;
            }
            $goodLine = $str;
         } else {
            $goodLine = $line;
         }
         if ($metrics['textHeight'] > $lineHeight) {
            $lineHeight = $metrics['textHeight'];
         }
      }
   }
   if ($goodLine != '') {
      $linesArr[] = $goodLine;
   }
   return array($linesArr, $lineHeight);
}

In words: the input is first cleaned up by replacing all runs of whitespace, including newlines, with a single space, except for leading and trailing whitespace, which is removed. Then it is split either at spaces, or right before Han characters not preceded by "leading" characters (like opening parentheses or opening quotes), or right before "leading" characters. Lines are assembled in order not to be rendered in more than $maxWidth pixels horizontally, except when this is not possible by the splitting rules (in which case the final rendering will probably overflow). A modification in order to force splitting in overflow cases is not difficult. Note that, e.g., Chinese punctuation is not classified as Han in Unicode, so that, except for "leading" punctuation, no linebreak can be inserted before it by the algorithm.

Hi Walter, your answer helped me to arrive at the final solution. Want to thank you for your help. — Kim Stacks, Jun 27 '12 at 04:28
Let me award the bounty to you as a gesture of thanks. You have indeed helped me a lot with your answer. — Kim Stacks, Jun 28 '12 at 09:33
@kimsia: that's very nice of you! When I'll find the time I'd like to fix that line splitting in order to make it work definitely for both English and Chinese (and maybe for other space-less languages too...). — Walter Tross, Jun 28 '12 at 10:11
Hi Walter that would be good. I think setting the internal coding is very important part of the answer. Hope you highlight this in yournew codd — Kim Stacks, Jun 29 '12 at 12:20

score 3 · Answer 2 · answered Jun 22 '12 at 13:38

3

I'm afraid you will have to choose a TTF that can support Chinese code points. There are many sources for this, here are two:

http://www.wazu.jp/gallery/Fonts_ChineseTraditional.html

http://wildboar.net/multilingual/asian/chinese/language/fonts/unicode/non-microsoft/non-microsoft.html

answered Jun 22 '12 at 13:38

Ja͢ck

170,779
38
263
309

@kimsia: or [Unicode](http://en.wikipedia.org/wiki/Unicode#Architecture_and_terminology) in Wikipedia – Walter Tross Jun 26 '12 at 21:29
Jack, why do you suppose that the TTF of kimsia does not support Chinese code points? BTW, your first link is about traditional Chinese, while what kimsia uses is simplified Chinese. – Walter Tross Jun 26 '12 at 21:33
@waltertross hmm op must have updated his question then :-/ will review later – Ja͢ck Jun 27 '12 at 01:14
It is cool. My question is solved, thanks to Walter. Thank you for trying. – Kim Stacks Jun 27 '12 at 04:29

score 3 · Accepted Answer · edited May 23 '17 at 12:18

3

Full solution here:

https://gist.github.com/2971092/232adc3ebfc4b45f0e6e8bb5934308d9051450a4

Key ideas:

Must set the html charset and internal encoding on the form and on the processing page

header('Content-Type: text/html; charset=utf-8');
mb_internal_encoding('utf-8');

These lines must be at the top lines of the php files.

Use this function to determine if text is Chinese and use the right font file

function isThisChineseText($text) {
    return preg_match("/\p{Han}+/u", $text);
}

For more details check out https://stackoverflow.com/a/11219301/80353

Set TextEncoding properly in ImagickDraw object

$draw = new ImagickDraw();

// set utf 8 format
$draw->setTextEncoding('UTF-8');

Note the Capitalized UTF. THis was helpfully pointed out to me by Walter Tross in his answer here: https://stackoverflow.com/a/11207521/80353

Use preg_match_all to explode English words, Chinese Words and spaces

// separate the text by chinese characters or words or spaces
preg_match_all('/([\w]+)|(.)/u', $text, $matches);
$words = $matches[0];

Inspired by this answer https://stackoverflow.com/a/4113903/80353

Works just as well for english text

edited May 23 '17 at 12:18

Community

1
1

answered Jun 27 '12 at 04:28

Kim Stacks

10,202
35
151
282

1

The last regex will split the string "UTF-8" into 3 separate "words". Your fix of wordWrapAnnotation is incorrect, also because it can now return a space or punctuation at the beginning of the second line. `explode(' ', ...)` was correct, unless there is some quirk of Chinese writing I'm not aware of. I also think you could have accepted my solution, since you used the two code fixes that it contains. It is true that you have added info, but that could have happened in comments (and I could have edited my solution too). – Walter Tross Jun 27 '12 at 07:42
OK, now I see what the "quirk" of Chinese writing is: there are, in general, no spaces between words. One way to split into "words", for your purposes, could be something like: `preg_split("/((?<= )|(?=\p{Han})(?=\pL))/u", $str, -1, PREG_SPLIT_NO_EMPTY)`, which "cuts" the string after spaces or before Han "letters" (words, in fact), but the trailing spaces should be handled separately (taken out and added back only if no line split occurs). Note: there is a space after `?<=`. – Walter Tross Jun 27 '12 at 22:27
1

The regex above should be enhanced in order not to allow certain characters to end a line (these characters are equivalent to western characters that are normally preceded by a space, like opening parentheses or opening quotes - e.g., see [here](http://msdn.microsoft.com/en-us/goglobal/bb688158.aspx)) – Walter Tross Jun 27 '12 at 23:04
@WalterTross I am sorry about not able to accept your answer. Your answer was the last clue I needed to solve the mystery. Breaking up the text into 2 separate lines turn out to be part of the requirements, so I am okay with doing that. I will try your earlier comment about using preg_split("/((?<= )|(?=\p{Han})(?=\pL))/u", $str, -1, PREG_SPLIT_NO_EMPTY) However, I noticed that this regex takes out spaces. I do want spaces for English text as the wordWrapAnnotation function is supposed to work for both English and Chinese for a shorter code. But nevertheless I will give that a shot. – Kim Stacks Jun 28 '12 at 00:28
1

There was a previous regex I wrote, which I deleted replacing it with the one you mention, exactly because it "ate up" spaces. I haven't tried this last one, but it shouldn't eat up anything, since it is simply an alternative between a lookbehind [assertion](http://it.php.net/manual/en/regexp.reference.assertions.php) and two lookahead assertions, with no matched characters. – Walter Tross Jun 28 '12 at 06:26

how to use imagick annotateImage for chinese text?

3 Answers3

Linked