2

I'm trying to use an ICU RuleBasedBreakIterator in C++ for segmenting Lao text into syllables. ICU has corresponding rules for Thai, which is "same same but different". The SOLR folks have something working in Java that I could get the rules from but I cannot find any example of how to instantiate a RuleBasedBreakIterator directly via its constructor that lets me specify the rules as opposed to the factory methods in BreakIterator. Here's what I have so far, a slightly modified function from the ICU docs:

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <string>
#include <unicode/unistr.h>
#include <unicode/rbbi.h>
#include <unicode/chariter.h>

using namespace std;

void listWordBoundaries(const UnicodeString&);

const char RULES[] = "";

int main(int argc, char *argv[]) {
    listWordBoundaries(UnicodeString::fromUTF8("ປະເທດລາວ"));
}

void listWordBoundaries(const UnicodeString& s) {
    UParseError parse_error;
    UErrorCode status = U_ZERO_ERROR;
    RuleBasedBreakIterator* bi = new RuleBasedBreakIterator(
        UnicodeString::fromUTF8(RULES), parse_error, status
    );

    if(!U_SUCCESS(status)) {
            fprintf(stderr, "Error creating RuleBasedBreakIterator\n");     // TODO print error
            if(U_MESSAGE_PARSE_ERROR == status) {
                    fprintf(stderr, "Parse error on line %d offset %d\n", parse_error.line, parse_error.offset);
            }
            exit(1);
    }

    bi->setText(s);
    int32_t p = bi->first();
    while (p != BreakIterator::DONE) {
            printf("Boundary at position %d (status %d)\n", p, bi->getRuleStatus());
            p = bi->next();
    }
    delete bi;
}

However, I get a segmentation fault as soon as I call bi->next due to a NULL statetable according to gdb:

Program received signal SIGSEGV, Segmentation fault.
icu_54::RuleBasedBreakIterator::handleNext (this=this@entry=0x614c70, statetable=0x0) at rbbi.cpp:1008
1008        UBool               lookAheadHardBreak = (statetable->fFlags & RBBI_LOOKAHEAD_HARD_BREAK) != 0;

The RULES string is supposed to hold the Lao.rbbi rules I linked to above. I have omitted it here because the effect is the same with an empty rule set. If I put some gibberish in the rules, the if(!U_SUCCESS(status)) check does work and the program exits with an error, so the rule parsing seems to work. However, even a U_SUCCESS return code doesn't seem to be sufficient to indicate that I can properly use the iterator.

Any ideas what I'm missing here?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
mbethke
  • 935
  • 8
  • 19
  • I don't know what the Solr rules are or why they haven't considered contributing the work to ICU itself. Which ICU version are you using, as ICU 53 and later had Lao rules, and improved in later versions? – Steven R. Loomis Jul 19 '16 at 21:32
  • The Solr version implements the breaking algorithm from [Syllabification of Lao Script for Line Breaking](http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf). ICU doesn't have the concept of syllable breaking, does it? TBH I didn't even know Lao was supported as Thai is mentioned in the docs but Lao isn't. I'm using 54.1 and it breaks words alright, but that's both too sophisticated and as a dictionary-based solution presumably too intolerant regarding neologisms or typos for my needs. – mbethke Jul 21 '16 at 22:19
  • 1
    OK. Yes, syllable isn't one of the types. Could be a good feature request. Anyway, i'm not sure what the root issue you're running into is. – Steven R. Loomis Jul 22 '16 at 19:05

0 Answers0