When determining the best pattern for your project, you will need to consider the following pattern factors:
- Accuracy (Robustness) -- whether the pattern is correct in all cases and is reasonably future-proof
- Efficiency -- the pattern should be direct, deliberate, and avoid unnecessary labor
- Brevity -- the pattern should use appropriate techniques to avoid unnecessary character length
- Readability -- the pattern should be keep as simple as possible
The above factors also happen to be in the hierarchical order that strive to obey. In other words, it doesn't make much sense to me to prioritize 2, 3, or 4 when 1 doesn't quite satisfy the requirements. Readability is at the bottom of the list for me because in most cases I can follow the syntax.
Capture Groups and Lookarounds often impact pattern efficiency. The truth is, unless you are executing this regex on thousands of input strings, there is no need to toil over efficiency. It is perhaps more important to focus on pattern readability which can be associated with pattern brevity.
Some patterns below will require some additional handling/flagging by their preg_
function, but here are some pattern comparisons based on the OP's sample input:
preg_split()
patterns:
/^[^A-Z]+\K|[A-Z][^A-Z]+\K/
(21 steps)
/(^[^A-Z]+|[A-Z][^A-Z]+)/
(26 steps)
/[^A-Z]+\K(?=[A-Z])/
(43 steps)
/(?=[A-Z])/
(50 steps)
/(?=[A-Z]+)/
(50 steps)
/([a-z]{1})[A-Z]{1}/
(53 steps)
/([a-z0-9])([A-Z])/
(68 steps)
/(?<=[a-z])(?=[A-Z])/x
(94 steps) ...for the record, the x
is useless.
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/
(134 steps)
preg_match_all()
patterns:
/[A-Z]?[a-z]+/
(14 steps)
/((?:^|[A-Z])[a-z]+)/
(35 steps)
I'll point out that there is a subtle difference between the output of preg_match_all()
and preg_split()
. preg_match_all()
will output a 2-dimensional array, in other words, all of the fullstring matches will be in the [0]
subarray; if there is a capture group used, those substrings will be in the [1]
subarray. On the other hand, preg_split()
only outputs a 1-dimensional array and therefore provides a less bloated and more direct path to the desired output.
Some of the patterns are insufficient when dealing with camelCase strings that contain an ALLCAPS/acronym substring in them. If this is a fringe case that is possible within your project, it is logical to only consider patterns that handle these cases correctly. I will not be testing TitleCase input strings because that is creeping too far from the question.
New Extended Battery of Test Strings:
oneTwoThreeFour
hasConsecutiveCAPS
newNASAModule
USAIsGreatAgain
Suitable preg_split()
patterns:
/[a-z]+\K|(?=[A-Z][a-z]+)/
(149 steps) *I had to use [a-z]
for the demo to count properly
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/
(547 steps)
Suitable preg_match_all()
pattern:
/[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z]|$)/
(75 steps)
Finally, my recommendations based on my pattern principles / factor hierarchy. Also, I recommend preg_split()
over preg_match_all()
(despite the patterns having less steps) as a matter of directness to the desired output structure. (of course, choose whatever you like)
Code: (Demo)
$noAcronyms = 'oneTwoThreeFour';
var_export(preg_split('~^[^A-Z]+\K|[A-Z][^A-Z]+\K~', $noAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+~', $noAcronyms, $out) ? $out[0] : []);
Code: (Demo)
$withAcronyms = 'newNASAModule';
var_export(preg_split('~[^A-Z]+\K|(?=[A-Z][^A-Z]+)~', $withAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+|[A-Z]+(?=[A-Z][^A-Z]|$)~', $withAcronyms, $out) ? $out[0] : []);