1

i'm new here and this is my first question. I'm trying to determine what letter is most frequent after given character in given text. To do this i'm creating arrays named from processed letters. PHP Version 5.5.9-1ubuntu4.17 default_charset UTF-8 Before sample text get into array, displays correctly, but taken from array shows � What have i done wrong ? ... This are my first steps in PHP so please be forgiving.

<!DOCTYPE html>
<html>
    <head>
    <meta charset="UTF-8">
    </head>
<body>
    
test HTML<br />
    <?php
    echo "test PHP";
    echo "<br />(...)<br />";
    $line=('abadańasakąaazaąaśaćałóańąaęażaź→↓←aß©ęœaπąśðæŋ’ə…ł≤µń”„aćaźż');          
    $strlen = strlen($line);  
    for ($i=0; $i<$strlen; $i++) 
    {
        $char = substr($line, $i, 1);
        $char2 = substr($line, $i+1, 1);
        echo $char;  // all characters looks fine here
        
        if (empty(${'sign_'.$char}[$char2])) 
        {                        
            ${'sign_'.$char}[$char2]=1;            
        }
        else
        {    
            ${'sign_'.$char}[$char2]++; 
        } 
        
    }
          
    echo "<br />(...)<br />";
    arsort ($sign_a);   
    echo "most frequent character after letter / a / is: <br />";
    foreach ($sign_a as $key => $val) 
    {
        echo " ... $key=$val "; // and here we have mess
    }
    echo "<br />(...)<br />"; 
    echo phpinfo();
   
    ?>

</body>
</html>
Fryziu DeMol
  • 13
  • 1
  • 5

2 Answers2

0

You need to use multibyte function, like so:

echo "test PHP";
echo "<br />(...)<br />";
$line=('abadańasakąaazaąaśaćałóańąaęażaź→↓←aß©ęœaπąśðæŋ’ə…ł≤µń”„aćaźż');
$strlen = mb_strlen($line);
for ($i=0; $i<$strlen; $i++)
{
    $char = mb_substr($line, $i, 1);
    $char2 = mb_substr($line, $i+1, 1);
    echo $char;  // all characters looks fine here

    if (empty(${'sign_'.$char}[$char2]))
    {
        ${'sign_'.$char}[$char2]=1;
    }
    else
    {
        ${'sign_'.$char}[$char2]++;
    }

}

echo "<br />(...)<br />";
arsort ($sign_a);
echo "most frequent character after letter / a / is: <br />";
foreach ($sign_a as $key => $val)
{
    echo " ... $key=$val "; // and here we have mess
}
echo "<br />(...)<br />";

Note the mb_strlen() and mb_substr() functions. See: http://php.net/manual/en/ref.mbstring.php

My output is now: ... ń=2 ... ć=2 ... ź=2 ... b=1 ... d=1 ... s=1 ... k=1 ... a=1 ... z=1 ... ą=1 ... ś=1 ... ł=1 ... ę=1 ... ż=1 ... ß=1 ... π=1

KIKO Software
  • 15,283
  • 3
  • 18
  • 33
  • Thank you both, i was introduced to the Zend Multibyte extension and PHP configuration files for UTF-8 encoding, which turned on my code. Notion about multidimensional arrays turned on my brain as well :) – Fryziu DeMol Jun 26 '16 at 10:34
0

Beside using multi-byte functions mb_*, you should consider to not create your array names at runtime by evaluating (potential suspicious) input.

Instead, simply use a two-dimensional array that stores the letter count as matrix
$chars[letter][trailingLetter].

Example:

  | a | b | c | < letter
a | - | - | - |
b | 1 | - | 1 | 
c | - | 2 | - |
^
trailingLetter

for $line = 'abcbc'. You can then ask $chars['a'] to get the same array as your $sign_a array. Here some code:

$line=('abadańasakąaazaąaśaćałóańąaęażaź→↓←aß©ęœaπąśðæŋ’ə…ł≤µń”„aćaźż');
$lineLength = mb_strlen($line);
$chars = array();

for ($i = 0; $i < $lineLength - 1; $i++) {
    $char = mb_substr($line, $i, 1);
    $char2 = mb_substr($line, $i+1, 1);

    if (!isset($chars[$char])) {        // create the second dimension for $char
        $chars[$char] = array();
    }
    if (empty($chars[$char][$char2])) { // no hit for $char2 after $char yet
        $chars[$char][$char2] = 1;
    } else {                            // another hit for $char2 after $char
        $chars[$char][$char2]++;
    }
}

$interestingChar = 'a';
$trailingChars = $chars[$interestingChar];
arsort ($trailingChars);

echo "most frequent character after letter / $interestingChar / is: <br />";
foreach ($trailingChars as $char => $count)
{
    echo " ... '$char'=$count ";
}
DEAD10CC
  • 910
  • 1
  • 6
  • 20