3

I have UTF-8 strings such as those below:

            21st century 

      Other languages 

         General collections 

         Ancient languages 

         Medieval languages 

            Several authors (Two or more languages) 

As you can see, the strings contain alphanumeric characters as well leading and trailing spaces.

I'd like to use PHP to retrieve the number of leading spaces (not trailing spaces) in each string. Note that the spaces might be non-standard ASCII spaces. I tried using:

var_dump(mb_ord($space_char, "UTF-8"));

where the $space_char contains a sample space character I copied from one of the above strings, and I got 160 rather than 32.

I have tried:

strspn($string,$cmask); // $cmask contains a string with two space characters with 160 and 32 as their Unicode code points.

but I get a very unpredictable value.

The values should be:

(1) 12
(2) 6
(3) 9
(4) 9
(5) 9
(6) 12

What am I doing wrong?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
forgodsakehold
  • 870
  • 10
  • 26

2 Answers2

1

Code

mb_internal_encoding('UTF-8');

$s = "            21st century 
      Other languages 
         General collections 
         Ancient languages 
         Medieval languages 
            Several authors (Two or more languages) ";

preg_match_all('/^([\s]*)[\S]/msu', $s, $leading);
# print_r($leading);
# echo mb_strlen($leading[1][0])."\n";
# echo mb_strlen($leading[1][1])."\n";
# echo mb_strlen($leading[1][2])."\n";
# echo mb_strlen($leading[1][3])."\n";
# echo mb_strlen($leading[1][4])."\n";
# echo mb_strlen($leading[1][5])."\n";
function print_leading_len($v, $k)
{
    echo mb_strlen($v)."\n";
}
array_walk($leading[1], 'print_leading_len');

Output

12
6
9
9
9
12
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
qräbnö
  • 2,722
  • 27
  • 40
  • An explanation (in the answer, not here in comments) would be in order. E.g., what is the regular expression supposed to do? Why is the "\S" part necessary? Can't it just match white space from the beginning of the line? What is array_walk() for? – Peter Mortensen Jan 31 '22 at 01:07
1

I would go the regular expression route:

<?php
function count_leading_spaces($str) {
    // \p{Zs} will match a whitespace character that is invisible,
    // but does take up space
    if (mb_ereg('^\p{Zs}+', $str, $regs) === false)
        return 0;
    return mb_strlen($regs[0]);
}

$samples = [
'            21st century ',
'      Other languages ',
'         General collections ',
'         Ancient languages ',
'         Medieval languages ',
'            Several authors (Two or more languages) ',
];

foreach ($samples as $i => $sample) {
    printf("(%d) %d\n", $i + 1, count_leading_spaces($sample));
}

Output:

(1) 12
(2) 6
(3) 9
(4) 9
(5) 9
(6) 12
Sean Bright
  • 118,630
  • 17
  • 138
  • 146
  • What are some examples of whitespace characters that are invisible? [ZERO WIDTH SPACE](https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128) (U+200B)? – Peter Mortensen Jan 31 '22 at 00:57
  • @PeterMortensen https://stackoverflow.com/questions/37534155/in-the-c-sharp-language-specification-what-is-meant-by-any-character-with-unico – Sean Bright Jan 31 '22 at 12:24