What is the best (cheapest) way to CamelCase complex input strings?

Question

I have a large number of real-time incoming phrases which need to be transofrmed to alpha only - CamelCase by word and split point.

That's what I came up so far, but is there any cheaper and faster way to perform that task?

function FoxJourneyLikeACamelsHump(string $string): string {
  $string = preg_replace("/[^[:alpha:][:space:]]/u", ' ', $string);
  $string = ucwords($string);
  $camelCase = preg_replace('/\s+/', '', $string);
  return $camelCase;
}

// $expected = "ThQuCkBrWnFXJumpsVRThLZyDG";
$string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
$is = FoxJourneyLikeACamelsHump($string);

Results:

Sentences: 100000000
Total time: 40.844197034836 seconds
average: 0.000000408

fair enough, not my intention to irritate anybody. Thought to bring attention to obstacles I'm facing without anybody needing to read to much, but will keep it in mind next time. — mkungla, Apr 08 '17 at 03:20
You do 0.2 billion regex based replaces in about 41 seconds - that's not good enough? — Robin Mackenzie, Apr 08 '17 at 05:56
You don't say why the current performance is an issue: you might need to frame the situation a bit more to contextualise it. We might be looking at the wrong part of the problem. As @RobinMackenzie alludes to... this *might* be a case of premature optimisation to me. Do you actually have a business-related problem you're trying to solve? ie: "this thing is taking too long, and we're losing money as a result". That's when one might need to start micro-optimising. Not saying you *don't* have a legit case; but yer not explaining it. — Adam Cameron, Apr 08 '17 at 06:21
Since you want to deal with unicode strings, you can't use functions like `ucwords` or `ucfirst` that are not unicode aware. — Casimir et Hippolyte, Apr 08 '17 at 10:57

trincot · Accepted Answer · 2017-04-08T12:50:53.683

Your code is quite efficient. You can still improve with a few tweaks:

Provide the delimiter to ucwords so it does not have to look for \t, \n, etc, which will not be in your string any way after the first step. On average this gives 1% improvement;
You can perform the last step with a non-regex replace on a space. This gives up to 20% improvement.

Code:

function FoxJourneyLikeACamelsHump(string $string): string {
    $string = preg_replace("/[^[:alpha:][:space:]]/u", ' ', $string);
    $string = ucwords($string, ' ');
    $camelCase = str_replace(' ', '', $string);
    return $camelCase;
}

See the timings for the original and improved version on rextester.com.

Note: As you used ucwords, your code cannot be used reliably for unicode strings in general. To cover for that you would need to use a function like mb_convert_case:

$string = mb_convert_case($string,  MB_CASE_TITLE);

... but this has a performance impact.

score 2 · Answer 2 · edited May 23 '17 at 12:25

Bench-marked against 3 alternatives, I believe your method is the fastest. Here's the results from 100,000 iterations:

array(4) {
  ["Test1"]=>
  float(0.23144102096558)
  ["Test2"]=>
  float(0.41140103340149)
  ["Test3"]=>
  float(0.31215810775757)
  ["Test4"]=>
  float(0.98423790931702)
}

Where Test1 is yours, Test2 and Test3 are mine, and Test4 is from @RizwanMTuman's answer (with a fix).

I thought using preg_split may give you an opportunity to optimise. In this function, only 1 regex is used and returns an array of only the alpha items to which you then apply ucfirst to:

function FoxJourneyLikeACamelsHump_2(string $string): string {
    return implode('', array_map(function($word) {
        return ucfirst($word);
    }, preg_split("/[^[:alpha:]]/", $string, null, PREG_SPLIT_NO_EMPTY)));
}

This can be further optimised by using foreach instead of array_map (see here):

function FoxJourneyLikeACamelsHump_3(string $string): string {
    $validItems = preg_split("/[^[:alpha:]]/u", $string, null, PREG_SPLIT_NO_EMPTY);
    $result = '';
    foreach($validItems as $item) {
        $result .= ucfirst($item);
    }
    return $result;
}

This leads me to speculate that 2 regexes and 1 ucwords is faster than 1 regex and multiple ucfirsts.

Full test script:

<?php

// yours
function FoxJourneyLikeACamelsHump_1(string $string): string {
  $string = preg_replace("/[^[:alpha:][:space:]]/u", ' ', $string);
  $string = ucwords($string);
  $camelCase = preg_replace('/\s+/', '', $string);
  return $camelCase;
}

// mine v1
function FoxJourneyLikeACamelsHump_2(string $string): string {
    return implode('', array_map(function($word) {
        return ucfirst($word);
    }, preg_split("/[^[:alpha:]]/", $string, null, PREG_SPLIT_NO_EMPTY)));
}

// mine v2
function FoxJourneyLikeACamelsHump_3(string $string): string {
    $validItems = preg_split("/[^[:alpha:]]/u", $string, null, PREG_SPLIT_NO_EMPTY);
    $result = '';
    foreach($validItems as $item) {
        $result .= ucfirst($item);
    }
    return $result;
}

// Rizwan with a fix
function FoxJourneyLikeACamelsHump_4(string $string): string {
    $re = '/(?:\b|\d+)([a-z])|[\d+ +!.@]/';
    $result = preg_replace_callback($re,function ($matches) {
        return (isset($matches[1]) ? strtoupper($matches[1]) : '');
    },$string);
    return $result;
}


// $expected = "ThQuCkBrWnFXJumpsVRThLZyDG";
$test1 = 0;
$test2 = 0;
$test3 = 0;
$test4 = 0;

$loops = 100000;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_1($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test1 = $time_end - $time_start;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_2($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test2 = $time_end - $time_start;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_3($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test3 = $time_end - $time_start;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_4($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test4 = $time_end - $time_start;

var_dump(array('Test1'=>$test1, 'Test2'=>$test2, 'Test3'=>$test3, 'Test4'=>$test4));

Mustofa Rizwan · Answer 3 · 2017-04-08T09:00:50.587

1

You can try this regex:

(?:\b|\d+)([a-z])|[\d+ +!.@]

UPDTAE ( Run it here )

Well the idea above is to show you how the thing should be working in regex:

The following is a php implementation of the above regex. You may compare it with yours as this enables the operation to be done by single replace operation:

<?php

$re = '/(?:\b|\d+)([a-z])|[\d+ +!.@]/';
$str = 'Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ';
$subst=strtoupper('\\1');

$result = preg_replace_callback($re,function ($matches) {
return (isset($matches[1]) ? strtoupper($matches[1]) : '');
    },$str);

echo $result;

?>

Regex Demo

edited Apr 08 '17 at 09:00

answered Apr 08 '17 at 05:56

Mustofa Rizwan

10,215
2
28
43

OP is asking in relation to speed. Also isn't `\U` a [regex101.com](https://regex101.com) specific thing? – vallentin Apr 08 '17 at 06:05
Downvoting & flagging as it doesn't answer the question. The regex dialect used is not valid in PHP, and clearly wasn't tested in PHP. – Adam Cameron Apr 08 '17 at 06:24
@AdamCameron I have updated the answer.. don't you think it was premature to do flagging an answer which didn't give you a php implementation rather generic idea of the solution ? – Mustofa Rizwan Apr 08 '17 at 07:36
You get a bunch of `PHP Notice: Undefined offset: 1` errors as well as the correct output with that. You might try `return (isset($matches[1]) ? strtoupper($matches[1]) : '');` – Robin Mackenzie Apr 08 '17 at 08:04
@RobinMackenzie you are absolutely right .. updated ... thanks mate – Mustofa Rizwan Apr 08 '17 at 09:01
@rizwan: I flagged it as "not an answer, but could be with some revision". You seem to actually agree. Not sure what the problem is. Your update is now good, and I'll remove my down vote. – Adam Cameron Apr 08 '17 at 13:09
@Adam Cameron, relax mate, I aint that serious either :) .. but I never got a flag thus not quite clear about flagging ;) – Mustofa Rizwan Apr 08 '17 at 15:08

score 0 · Answer 4 · answered Apr 09 '17 at 00:54

Before thinking to improve performances of a code, you need first to build a code that works. Actually you are trying to build a code that handles utf8 encoded strings (since you added the u modifier to your pattern); but with the string: liberté égalité fraternité your code returns Liberté égalité Fraternité instead of Liberté Égalité Fraternité because ucwords (or ucfirst) are not able to deal with multibyte characters.

After trying different approaches (with preg_split and preg_replace_callback), it seems that this preg_match_all version is the fastest:

function FoxJourneyLikeACamelsHumpUPMA(string $string): string {
    preg_match_all('~\pL+~u', $string, $m);
    foreach ($m[0] as &$v) {
        $v = mb_strtoupper(mb_substr($v, 0, 1)) . mb_strtolower(mb_substr($v, 1));
    }
    return implode('', $m[0]);
}

Obviously, it's slower than your initial code, but we can't really compare these different codes since yours doesn't work.

What is the best (cheapest) way to CamelCase complex input strings?

Results:

4 Answers4