0

I am looking for assistance in making this code more accurate. For any given text ($my_block_of_text) the script below will breakup the content into sentences based on where fullstops, exclamation marks and similar end-of-sentence punctuation occurs.

   $parts = preg_split('/([.?!:\]])/', $my_block_of_text, -1, PREG_SPLIT_DELIM_CAPTURE);
   $sentences = array();
   for ($i=0, $n=count($parts)-1; $i<$n; $i+=2) {
    $sentences[] = $parts[$i].$parts[$i+1];
   }
   if ($parts[$n] != '') {
    $sentences[] = $parts[$n];
   }

The issue with this code however, is that the regular expression being used in the preg_split function doesn't take into account instances of Mr. Mrs. Miss. Ms. How can an exclusion be added to a regular expression to avoid these instances?

Thanks.

feelsickened
  • 57
  • 2
  • 10
  • 1
    There are an endless amount of abbreviations where you don't want to break up the content. There is no simple solution. What we humans can do with ease can be very difficult to capture in an algorithm. – KIKO Software Mar 25 '15 at 14:50
  • Did you take a look at http://stackoverflow.com/questions/16377437/split-a-text-into-sentences? it doesn't solve your Mr. and Mrs. problem but its as close as it gets. There is also the extended solution: http://stackoverflow.com/questions/5032210/php-sentence-boundaries-detection – Marc Mar 25 '15 at 14:51
  • What I personally do in this case is replace the problem substrings with a substitute. For example, you can replace Mr. with =MR=. Then, your preg_replace will work. When you are done, you reverse the replacement, replacing =MR= with Mr. Because you are replace an array of values with an array of values, the entire process becomes replace, split, replace. I feel it is less complicated than a monstrous regular expression. – kainaw Mar 25 '15 at 16:04
  • @Marc - thanks for the helpful & constructive point in the right direction. – feelsickened Mar 27 '15 at 17:52

1 Answers1

0

The best answer I have found for creating arrays of coherent sentences is the regex solution found in the link suggested by @Marc in the comments above.

The best thing about this regex is that you can add to it. For example I've added abbreviations for months such as SEPT. which are typically followed with fullstops.

https://stackoverflow.com/a/7438782/3662086

Community
  • 1
  • 1
feelsickened
  • 57
  • 2
  • 10