5

I want to split a large string by a series of words.

E.g.

$splitby = array('these','are','the','words','to','split','by');
$text = 'This is the string which needs to be split by the above words.';

Then the results would be:

$text[0]='This is';
$text[1]='string which needs';
$text[2]='be';
$text[3]='above';
$text[4]='.';

How can I do this? Is preg_split the best way, or is there a more efficient method? I'd like it to be as fast as possible, as I'll be splitting hundreds of MB of files.

hakre
  • 193,403
  • 52
  • 435
  • 836
Alasdair
  • 13,348
  • 18
  • 82
  • 138
  • Afternote: racar's answer is the fastest, if array_flip is performed on $splitby and then isset() is used instead of in_array(). preg_split does not work because there are hundreds of words in $splitby. – Alasdair Nov 10 '11 at 07:05

4 Answers4

7

This should be reasonably efficient. However you may want to test with some files and report back on the performance.

$splitby = array('these','are','the','words','to','split','by');
$text = 'This is the string which needs to be split by the above words.';
$pattern = '/\s?'.implode($splitby, '\s?|\s?').'\s?/';
$result = preg_split($pattern, $text, -1, PREG_SPLIT_NO_EMPTY);
mellamokb
  • 56,094
  • 12
  • 110
  • 136
5

preg_split can be used as:

$pieces = preg_split('/'.implode('\s*|\s*',$splitby).'/',$text,-1,PREG_SPLIT_NO_EMPTY);

See it

codaddict
  • 445,704
  • 82
  • 492
  • 529
4

I don't think using pcre regex is necessary ... if it's really splitting words you need.

You could do something like this and benchmark see if it's faster / better ...

$splitby = array('these','are','the','words','to','split','by');
$text = 'This is the string which needs to be split by the above words.';

$split = explode(' ', $text);
$result = array();
$temp = array();

foreach ($split as $s) {

    if (in_array($s, $splitby)) {
        if (sizeof($temp) > 0) {
           $result[] = implode(' ', $temp);
           $temp = array();
        }            
    } else {
        $temp[] = $s;
    }
}

if (sizeof($temp) > 0) {
    $result[] = implode(' ', $temp);
}

var_dump($result);

/* output

array(4) {
  [0]=>
  string(7) "This is"
  [1]=>
  string(18) "string which needs"
  [2]=>
  string(2) "be"
  [3]=>
  string(5) "above words."
}

The only difference with your output is the last word because "words." != "word" and it's not a split word.

malletjo
  • 1,766
  • 16
  • 18
  • Thank you for your help. Though in_array() is very slow for large arrays, preg_split is much faster. – Alasdair Nov 10 '11 at 04:00
  • maybe you're right, but you may get "Compilation failed: regular expression is too large at offset ******" if you use preg_split. I just try with a array of 5490 words and it failed. – malletjo Nov 10 '11 at 04:44
  • Well it turned out that preg_split was taking too long for my liking. See my solution below. Your solution is good, but in_array() function has problems in PHP. A faster way to check for the existence for a value in an array is to array_flip the array and then check for the existence of the key with isset(), which is about 1000x faster than using in_array(). – Alasdair Nov 10 '11 at 04:52
  • array_flip + isset seems a good idea. But the difference is "only" 30ms for an array of 200k element. – malletjo Nov 10 '11 at 05:12
  • In my experience the difference is seconds vs. hours, literally. I think there's a serious problem with in_array(). Anyway, neither the preg_split nor my method I posted then deleted has achieved what I want. I'm now testing your method modified to use isset(). – Alasdair Nov 10 '11 at 05:22
  • Genius! With modification to use array_flip() & isset() this is both fast and efficient. I'm using this now. Thank you! – Alasdair Nov 10 '11 at 05:43
  • you still can optimize my code by removing sizeof and use a variable instead and maybe some other micro-optimization – malletjo Nov 10 '11 at 22:24
-1

Since the words in your $splitby array are not regular expression maybe you can use

str_split

Yada
  • 30,349
  • 24
  • 103
  • 144
  • `str_split()` cannot separate a string by a string. It merely splits a string up into an array of characters the length of the last argument (which defaults to 1). – Bailey Parker Nov 10 '11 at 03:20
  • This answer doesn't make sense, considering he wants to split the string by the specific words, not split it into word-sized chunks. – Joe C. Nov 10 '11 at 03:24