7

I did an experiment out of curiosity. I wanted to see if there was a micro difference at all between strtolower() and strtoupper(). I expected strtolower() would be faster on mostly lowercase strings and visa versa. What I found is that strtolower() was slower in all cases (although in a completely insignificant way until you're doing it millions of times.) This was my test.

$string = 'hello world';
$start_time = microtime();
for ($i = 0; $i < 10000000; $i++) {
    strtolower($string);
}
$timed = microtime() - $start_time;
echo 'strtolower ' . $string . ' - ' . $timed . '<br>';

Repeated for strtolower() and strtoupper() with hello world, HELLO WORLD, and Hello World. Here is the full gist. I've ran the code several times and keep getting roughly the same results. Here's one run of the test below. (Done with the original source which used $i < $max = 1000000 as in the gist, so potentially extra overhead in the loop; see comments.)

strtolower hello world - 0.043829
strtoupper hello world - 0.04062
strtolower HELLO WORLD - 0.042691
strtoupper HELLO WORLD - 0.015475
strtolower Hello World - 0.033626
strtoupper Hello World - 0.017022

I believe the C code in the php-src github that controls this is here for strtolower() and here for strtoupper()

To be clear, this isn't going to prevent me from ever using strtolower(). I am only trying to understand what is going on here.

Why is strtolower() slower than strtoupper()?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Goose
  • 4,764
  • 5
  • 45
  • 84
  • 6
    Why is this tagged C? – Jabberwocky Jun 27 '17 at 13:53
  • 5
    @MichaelWalz Because PHP is written in C? – Andreas Jun 27 '17 at 13:54
  • Why are you wasting time with the assignment to `$max`? Are you wasting time in both versions? – pmg Jun 27 '17 at 13:54
  • @Andreas Nope, PHP itself is not written in C, [the PHP interpreter itself written in C](https://softwareengineering.stackexchange.com/a/176512). – TripleDeal Jun 27 '17 at 13:58
  • @pmg the assignment to $max is just to make the code more clear. The code is tagged C because the behavior is likely caused by the source code in C. – Goose Jun 27 '17 at 13:58
  • 1
    I am curious why people are even bothering to test these things? – Eugene Sh. Jun 27 '17 at 13:58
  • 2
    @EugeneSh. I wanted to test my assumption that the difference was insignificant, which it was, but I got intrigued by what I found to be an unexpected result, so I thought I'd ask. – Goose Jun 27 '17 at 13:59
  • 1
    How is `$i < $max = 1000000` clearer than `$i < 1000000`? – pmg Jun 27 '17 at 13:59
  • 2
    @TripleDeal Isn't it kind of obvious that was what I meant? – Andreas Jun 27 '17 at 14:00
  • @pmg fair enough, removed. – Goose Jun 27 '17 at 14:00
  • @Andreas Sorry, I wasn't sure. – TripleDeal Jun 27 '17 at 14:00
  • Why not to take the sources, compile and look at the disassembly? – Eugene Sh. Jun 27 '17 at 14:01
  • For information, that's probably the tolower/toupper functions referenced in the PHP source code: https://github.com/torvalds/linux/blob/a379f71a30dddbd2e7393624e455ce53c87965d1/include/linux/ctype.h It doesn't really explain anything though since the functions and macros they call seem to be equivalent, so it could be some weird compiler optimisation that happens in one case and not another. – laurent Jun 27 '17 at 14:13
  • It's tagged C because c is UPPERCASE. Or lowercase. Or something. – headbanger Jun 27 '17 at 14:24
  • The edit that changed `$i < $max = 1000000` to `$i < 1000000` (per pmg's suggestion) didn't update the benchmark numbers. I'd guess that any extra interpreter overhead from that would just be additive with the work of allocating/copying and case-converting strings, but you don't know for sure without testing. And for [mcve] reasons, it's always bad to have benchmark numbers that aren't from the code shown. 5 years later you probably don't have the same software versions (or maybe even CPU) to reproduce this, so probably best to show the code that matches the numbers. – Peter Cordes Aug 04 '22 at 20:38
  • Why did you accept the answer below given it's deliberately incorrect? – Your Common Sense Aug 05 '22 at 07:17
  • you need to change microtime() to microtime( true ) otherwise the results may come out to be incorrect – Aurangzeb Aug 05 '22 at 12:34

3 Answers3

4

It mostly depends on which character encoding you are currently using, but the main cause of the speed difference is the size of each encoded character of special characters.

Taken from babelstone.co.uk:

For example, lowercase j with caron (ǰ) is represented as a single encoded character (U+01F0 LATIN SMALL LETTER J WITH CARON), but the corresponding uppercase character (J̌) is represented in Unicode as a sequence of two encoded characters (U+004A LATIN CAPITAL LETTER J + U+030C COMBINING CARON).

More data to sift through in the index of Unicode characters will inevitably take a little longer.

Keep in mind, that strtolower uses your current locale, so if your server is using character encoding that does not support strtolower of special characters (such as 'Ê'), it will simply return the special character. The character mapping on UTF-8 is however set up, which can be confirmed by running mb_strtolower.

There is also the possibility of comparing the number of characters that fall into the category of uppercase vs the amount you will find in the lowercase category, but once again, that is dependent on your character encoding.

In short, strtolower has a bigger database of characters to compare each individual string character to when it checks whether or not the character is uppercase.

Frits
  • 7,341
  • 10
  • 42
  • 60
  • 1
    I don't think strtolower/upper do this kind of unicode encoding, it just works with plain ASCII. I've just tried `echo strtoupper('ç');` and got `ç`. Maybe the mb_* function would do this, but not the regular ones. – laurent Jun 27 '17 at 14:12
  • 1
    @this.lau_ thanks, from what I can see, if your server locale uses UTF-8 already, the conversion should happen in any case. If I understand the reasoning correctly, the mb_strtolower function would be used to force the UTF-8 character mapping. I've adjusted my answer to show this. – Frits Jun 27 '17 at 14:48
  • @YourCommonSense there's no need to be rude. The whole sentence you are referring to says " **If I understand the reasoning correctly**, the mb_strtolower function would be used to force the UTF-8 character mapping". Please try to be more constructive in your feedback. We're all here to learn, and I'd much rather be wrong and learn than live in ignorance. Your comment however achieves neither. – Frits Aug 15 '22 at 09:19
4

There are a couple of very slight differences in the implementation of the code:

PHPAPI char *php_strtoupper(char *s, size_t len)
{
    unsigned char *c, *e;

    c = (unsigned char *)s;
    e = (unsigned char *)c+len;    <-- strtolower uses e = c+len;

    while (c < e) {
        *c = toupper(*c);
        c++;
    }
    return s;
}

PHPAPI zend_string *php_string_toupper(zend_string *s)
{
    unsigned char *c, *e;

    c = (unsigned char *)ZSTR_VAL(s);
    e = c + ZSTR_LEN(s);

    while (c < e) {
        if (islower(*c)) {
            register unsigned char *r;
            zend_string *res = zend_string_alloc(ZSTR_LEN(s), 0);

            if (c != (unsigned char*)ZSTR_VAL(s)) {
                memcpy(ZSTR_VAL(res), ZSTR_VAL(s), c - (unsigned char*)ZSTR_VAL(s));
            }
            r = c + (ZSTR_VAL(res) - ZSTR_VAL(s));
            while (c < e) {
                *r = toupper(*c);
                r++;
                c++;
            }
            *r = '\0';
            return res;
        }
        c++;
    }
    return zend_string_copy(s);
}

PHP_FUNCTION(strtoupper)
{
    zend_string *arg;      <-- strtolower uses zend_string *str;

    ZEND_PARSE_PARAMETERS_START(1, 1)
        Z_PARAM_STR(arg)          <-- strtolower uses Z_PARAM_STR(str)
    ZEND_PARSE_PARAMETERS_END();

    RETURN_STR(php_string_toupper(arg));     <-- strtolower uses RETURN_STR(php_string_tolower(str));
}

and for strtolower

PHPAPI char *php_strtolower(char *s, size_t len)
{
    unsigned char *c, *e;

    c = (unsigned char *)s;
    e = c+len;                  <-- strtoupper uses e = (unsigned char *)c+len;

    while (c < e) {
        *c = tolower(*c);
        c++;
    }
    return s;
}

PHPAPI zend_string *php_string_tolower(zend_string *s)
{
    unsigned char *c, *e;

    c = (unsigned char *)ZSTR_VAL(s);
    e = c + ZSTR_LEN(s);

    while (c < e) {
        if (isupper(*c)) {
            register unsigned char *r;
            zend_string *res = zend_string_alloc(ZSTR_LEN(s), 0);

            if (c != (unsigned char*)ZSTR_VAL(s)) {
                memcpy(ZSTR_VAL(res), ZSTR_VAL(s), c - (unsigned char*)ZSTR_VAL(s));
            }
            r = c + (ZSTR_VAL(res) - ZSTR_VAL(s));
            while (c < e) {
                *r = tolower(*c);
                r++;
                c++;
            }
            *r = '\0';
            return res;
        }
        c++;
    }
    return zend_string_copy(s);
}

PHP_FUNCTION(strtolower)
{
    zend_string *str;     <-- strtoupper uses zend_string *arg; 

    ZEND_PARSE_PARAMETERS_START(1, 1)
        Z_PARAM_STR(str)        <-- strtoupper uses Z_PARAM_STR(arg)
    ZEND_PARSE_PARAMETERS_END();

    RETURN_STR(php_string_tolower(str));    <-- strtoupper uses RETURN_STR(php_string_tolower(arg));
}

Whether these minor differences are enough to affect performance by those few nanoseconds, I don't know.... unsure why the differences are even there

Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • That's probably it - minor implementation differences that show a small difference over millions of operations. – laurent Jun 27 '17 at 14:19
  • 1
    One difference is how a variable is named, while the other is the inclusion of a redundant cast. I highly doubt these make any difference. – dbush Jun 27 '17 at 14:39
  • I find this answer insufficient. I had already noticed these differences when looking at the source code I linked, but did not understand these differences because I do not understand the C language. This answer hasn't shed much light for me. Perhaps somebody can explain why the differences exist and what effect they might have. – Goose Jun 27 '17 at 14:42
  • 2
    Indeed, `arg` vs `str` is just a variable name and can't possibly make a difference. The cast thing ... I'm not so sure about, **but** PHP 7 doesn't use it - `php_strtolower()`, `php_strtoupper()` are legacy internal functions from the PHP 5 sources (some extensions use them). If there's any real difference (it's a naive benchmark that we have), it would come from C's own `islower()`, `isupper()` and `tolower()`, `toupper()` functions. – Narf Jun 27 '17 at 14:45
0

Your gist has a problem, you're using microtime() which gives this result

strtolower hello world - 0.291195
strtoupper hello world - 0.42024
strtolower HELLO WORLD - -0.616288
strtoupper HELLO WORLD - 0.297302
strtolower Hello World - 0.384197
strtoupper Hello World - -0.558648

You should use microtime( true ) which returns an integer value instead of a complex one, so here are the results using microtime( true )

strtolower hello world - 0.33714199066162
strtoupper hello world - 0.42499804496765
strtolower HELLO WORLD - 0.38841104507446
strtoupper HELLO WORLD - 0.3044650554657
strtolower Hello World - 0.40368413925171
strtoupper Hello World - 0.444669008255

So the problem seems to be in the code

Aurangzeb
  • 1,537
  • 13
  • 9