Undefined offsets and diacritical marks

Question

I'm trying to parse Laotian text with utf8_ireplace and I'm getting an

undefined offset notice.

The one thing I can see is that there are diacritical marks. Would that cause that warning? Or can someone give me a clue of why it would always be Laotian (of 6 languages I'm processing)?

Is there a special way that Laotian and similar languages (such as Tibetan) should be handled differently with regard to utf8_replace? Is it a known issue that it raises notices with some characters in those languages? Are diacritcals the issue or something else? Does anyone know how not to get notices besides turning off notice reporting?

Update: Actually me add that in Laotian there are no spaces between words so you have to separate the strings of characters, and that's what I am using utf8_replace for, but it's failing for Laotian even though it seems to work for Thai for example. So it's really I'm trying to break up strings of characters but for some reason the offsets are undefined. Tibetan also seems to have problems e.g. "α╜ª"

Update

Here is the central question: Why is it that I get notices using utf8_replace on some words in Laotian?

(Joomla)

// Iterate through the terms and test if they contain the relevant characters.
for ($i = 0, $n = count($terms); $i < $n; $i++)
{
    $charMatches = array();
    if ($lang === 'zh')
    {
        $charCount = preg_match_all('#[\x{4E00}-\x{9FCF}]#mui', $terms[$i], $charMatches);
    }

    elseif ($lang === 'ja')
    {
        // Kanji (Han), Katakana and Hiragana are each checked
        $charCount = preg_match_all('#[\x{4E00}-\x{9FCF}]#mui', $terms[$i], $charMatches);
        $charCount += preg_match_all('#[\x{3040–\x{309F}]#mui', $terms[$i], $charMatches);
        $charCount += preg_match_all('#[\x{30A0}-\x{30FF}]#mui', $terms[$i], $charMatches);
    }
    elseif ($lang === 'th')
    {
        $charCount = preg_match_all('#[\x{0E00}-\x{0E7F}]#mui', $terms[$i], $charMatches);
    }
    elseif ($lang === 'km')
    {
        $charCount = preg_match_all('#[\x{1780}-\x{17FF}]#mui', $terms[$i], $charMatches);
    }
    elseif ($lang === 'lo')
    {
        $charCount = preg_match_all('#[\x{0E80}-\x{30EFF}]#mui', $terms[$i], $charMatches);
    }
    elseif ($lang === 'my')
    {
        $charCount = preg_match_all('#[\x{1000}-\x{109F}]#mui', $terms[$i], $charMatches);
    }
    elseif ($lang === 'bo')
    {
        $charCount = preg_match_all('#[\x{0F00}-\x{0FFF}]#mui', $terms[$i], $charMatches);
    }
    // Split apart any groups of characters.
    for ($j = 0; $j < $charCount; $j++)
    {
        if (isset($charMatches[0][$j]))
        {
            $tSplit = JString::str_ireplace($charMatches[0][$j], '', $terms[$i], null);

            if (!empty($tSplit))
            {
                $terms[$i] = $tSplit;
            }
            else
            {
                unset($terms[$i]);
            }

            $terms[] = $charMatches[0][$j];
        }
    }
}

// Reset array keys.
$terms = array_values($terms);

I don't actually have any my or bo (Myanmar or Tibetan) sample data but I do have Thai, Japanese, Chinese traditional, and Laotian. — Elin, Mar 31 '13 at 01:30
I don't understand why this is not a real question. I want to know if there is an issue with dealing with diacriticals or something else going on with handling Laotian. I don't know how that's not a real question. I'll try to rephrase — Elin, Apr 01 '13 at 02:11
Can you please provide some input data to test what's going on? Even for your own unit tests, this will help. So maybe wrap the code into a function `splitWords($lang, array $terms)` and provide the input. Maybe it is a bug with your PHP version? - Try the code at http://3v4l.org/. — Shi, Jun 09 '14 at 12:31
Do you have *internal encoding* configured properly and also are your input data really UTF-8 encoded? — David Ferenczy Rogožan, Jul 30 '14 at 16:11
Yes and yes. I do have data but obviously I can't paste it in if you want to be able to see what Tibetan looks like. http://joomlacode.org/gf/download/trackeritem/27511/79412/lexerdata.sql should start a download. — Elin, Jul 30 '14 at 21:17

score 0 · Accepted Answer · answered Sep 22 '14 at 11:43

I think the offset error could refer to the regex used in preg_match. I've tested the regex for 'lo' using regex101.com and it returns this error:

\x{30EFF} Character offset is too large. Reduce it to 4 hexadecimal characters or enable UTF-16 (u-modifier)

The other regexes tested just fine.

Undefined offsets and diacritical marks

Update

1 Answers1