32

Situation with Thai text on a client site is that we can't control where exactly particular words/sentences are going to break between the lines (how web browser will handle it). Often, content appearance is indicated as incorrect by local reviewers.

Workaround to this is that copywriter needs to deliver Thai content with breaking ​ and non-breaking  zero-width-space chars included.

In practice, rather than:
ของเพื่อนๆ ที่ออนไลน์อยู่
we should use something as ugly as:
ของเพื่อนๆ​ที่​ออนไลน์อยู่

The above is just an example, I don't really know where exactly the breakpoints are allowed.
In fact, non-breaking zero spaces alone would do the trick either ... it's just more strict and correct to use breaking ones as well for better accuracy.

And while it definitely is doable like this, it also is a time consuming and not very effective solution for a large site content management. Simply said, the effort put into it doesn't match the effect needed.

Research so far has lead to the workaround mentioned, looking for a better way how to handle this. Even W3C doesn't have a solution yet and is just discussing whether it should be part of CSS3 specification.

Thai language utilizes spaces very rarely, mostly to distinguish between sentences etc. Therefore, common appearance of a Thai sentence is one looong string. Where to break such a string when more lines of text are put together is determined by particular words identification. For words identification local dictionaries are used which are most probably part of operating system or web browser, I'm not entirely sure about these.

Apparently, the more web browsers / operating systems you check on the more results you get! Moreover, there's not much you can do about this as it's system driven and there are no "where to break Thai" settings available.

Using <wbr/>, &#8203; or &shy; to indicate where the breakpoints really are won't prevent web browser thinking (even though wrong) that some breaks are also possible in places, where you haven't defined them e.g. in the middle of a word which might be grammatically incorrect.

If such a word is placed at the end of a line (depends on screen resolution, copy length, CSS rules defined) and the browser applies his wrong line breaking rule on it then you would end up with a Thai line breaking issue, no matter that you have defined another breakpoints before, after or somewhere else in the word - browser will always use a breakpoint that he thinks is closest to EOL, not just the ones you have gently suggested by inserting one of the mentioned chars in your markup.

That's why you actually need to focus on where not to break your text (non-breaking zero-width-space), not where it's allowed. And that's what lead us back to the ugly and long markup example in the "Workaround" section above. That way a line break can strictly only occur where you have allowed it to be, but it's messy.

Any other solution how to handle this more effectively would be appreciated ... and who knows, it might even help W3C in their implementation?

THANK YOU!

Marcin
  • 48,559
  • 18
  • 128
  • 201
joooc
  • 583
  • 2
  • 5
  • 9
  • Did you specify the correct charset in the meta tag of your webpage? Maybe here you'll find an answer: http://www.webmasterworld.com/forum21/7039.htm – Bart1990 Dec 13 '11 at 17:48
  • 2
    @Bart Charsets don't have much influence on text wrapping rules, only on the encoding of the text. – deceze Dec 15 '11 at 03:45
  • To note, this is the same problem in other languages that do not use spaces, like Japanese. There are a few rules regarding Japanese where it *shouldn't* break, like right before a "。" or that phonemes like しゃ shouldn't be broken apart, and most browsers adhere to these rules. This still often leads to unnatural word breaks that would be avoided in professional typesetting. Maybe you can generalize your question beyond Thai? – deceze Dec 15 '11 at 03:47
  • The _[result](http://www.joooc.info/thai-line-breaking-issue/utf8-vs-tis620.png)_ of changing `charset=UTF-8"` to `charset=TIS-620"` on a test page were mdashes, •, € and some other characters added to the output and instead of 240 initial chars we got 616! An unwelcome surprise that isn't resolving the issue, actually making it even worse. Nowadays, TIS-620 seems to be archaic and apart of some bytes being saved here and there it's not recommended anymore. – joooc Jan 09 '12 at 14:48

3 Answers3

32

I know this thread was quite some time but I have something to say as a native Thai. I read lots of Thai web pages everyday and I feel the quality of Thai line breaking by the modern web browsers nowadays is perfectly acceptable.

As I know, Google Chrome browser uses ICU4C, Internet Explorer uses Uniscribe API, and Firefox uses libthai to break Thai sentences into words. For Thai people I know, how these web browsers handle line breaks in Thai is perfectly acceptable for them. (actually we used to have this problem with very early version of Firefox (1.x) but that is resolved now.)

Thai line breaking and word breaking, unlike western languages, is still considered an unsolved problem and is still actively tackled by many linguistics researchers. Currently there is no implementation that could perfectly break a sentence to Thai words. IBM ICU Boundary Analysis page contains some analysis on this problem.

Many times, it has something to do with the context. For example, the phrase "ตากลม" can be correctly broken to "ตา","กลม" or "ตาก","ลม". Each way says totally different thing but Thai readers can still perfectly understand the intended meaning, given the context.

Given that your local reviewers are already familiar with reading Thai websites, I think maybe they are too pushy on you to resolve this problem. This is common unsolvable problem for all Thai websites, web browsers, and even Microsoft Word.

It is best to wait (or contribute to IBM ICU) until Thai sentence breaking implementation gets better. Let the web browsers handle this. I don't think trying to workaround this problem worth your valuable time. As as I know, even Thai website publishers here just don't care to get this one right.

Should you need to publish a document with a perfect line/word breaking, you may consider other medium, such as PDF document in which you should have more control over the line breaks.

Hope this helps :)

Gant
  • 29,661
  • 6
  • 46
  • 65
  • 1
    This summary couldn't be better! Thank you @m3rLinEz for a great response! My conclusion would be: if you can - contribute to [ICU](http://site.icu-project.org/), if you can't - wait :-) – joooc Feb 04 '12 at 22:27
  • Are there any new solutions or updates to this answer, or is this still a research problem? – Gurpreet Jul 09 '15 at 11:24
  • @Gurpreet I think the situation remains pretty much the same. For example, I looked at ICU's Thai dictionary (`svn log source/data/brkitr/thaidict.txt`) and I haven't seen much additions/improvements. Most people would be happy with the quality of the word/line breaking done by modern web browsers. I'd love to hear about your situations though, as I admit I don't see much need to do perfect Thai word break for web content. – Gant Jul 09 '15 at 19:57
  • @Gant Hey Gant :) thank you very much for your valuable info - I appreciate it. May I ask an additional question: We are developing a mobile app where we have to dyncamically break thai sentences into lines (so that it fits into a dialog for example). Do you (and other native speakers) still understand the meaning of a sentence if we break it just somewhere (or multiple times)? Thank you and best regards from Switzerland! – sjkm Jul 26 '16 at 14:56
  • @sjkm Break it just somew here wouldn't cu t it I'm not super familiar with mobile dev, but shouldn't Android and iOS already have the support these days (if you stuff the text inside the text boxes). – Gant Jul 26 '16 at 16:18
  • @Gant thanks for getting back to me! It's inside a game where we draw everything with our custom logic so we cannot make use of the android default breaking/handling. Am I wrong when I say that it should still be readable when we just break to a new line because of space limit? The reader can just go and read on on the next line....wrong logic? Have a nice day! – sjkm Jul 26 '16 at 17:39
  • 1
    @sjkm I see. So it's custom drawing inside game. That's right that it'll still be readable, though it'll be something that is going to annoy your readers. It would feel like reading, for example, "hello wo\nrld. this is on a sepa\nrate line". Maybe you know best whether that is okay. – Gant Jul 26 '16 at 20:26
  • 1
    @Gant Thank you for that comparison :) hard to imagine... but interesting. I guess we have to go this way due to the lack of alternatives, right? – sjkm Jul 27 '16 at 07:42
  • 1
    As I noted, ICU is now part of Unicode. Note tickets such as https://ssl.icu-project.org/trac/ticket/11775 – Steven R. Loomis May 10 '17 at 18:47
3

The ICU and ICU4J libraries have a dictionary based word break iterator for Thai that you could use on the server side to inject breaking zero width spaces where appropriate.

Or, you could use this to build a utility that could run at build time or on delivery of translations, if you knew the spacing requirements that far in advance.

see ICU Boundary Analysis for more info. These libraries are available for C, C++, and Java.

J.Spiral
  • 475
  • 5
  • 10
  • Thank you @J.Spiral for the ICU suggestion. I've asked the client to get their Thai reviewer to have a look on it first to make sure the dictionaries are actually matching word breaks correctly. If that will be the case then we might give it a try and develop a script/tool based on ICU. And if that works, I'l let you all know :-) Thx – joooc Jan 10 '12 at 09:29
  • See my other answer at http://stackoverflow.com/questions/8492763/thai-line-breaking-how-to-break-thai-text-effectively#comment74834802_8950895 – Steven R. Loomis May 10 '17 at 18:46
0

There is a W3C working group working exactly on this (for Thai and other Southeast Asian languages). Their layout requirement draft is quite recent, from last month:

I hope these info can feed into the fruitful discussion here.

You can also follow/join the Southeast Asia Language Enablement (sealreq) activity on GitHub: https://github.com/w3c/sealreq

bact'
  • 64
  • 1
  • 5