0

I'm using preg_split regex to split sentences into arrays. I am able to do this successfully. However, part of the pattern I'm telling preg_replace to find is part of the text itself. So part of the text is being removed as well. Is there a way to re-insert the pattern into the array? For instance, if I tell preg_spit to search for a period and a capital letter after that, it will remove the capital letter from the array, which I don't want.

This is the code:

$line = preg_split("@[\.\?\!\:][\W]+[A-Z]@"

Sample String:

This is sentence one. This is sentence two? This is sentence three! This is sentence four: This is sentence five. This is sentence six, this is also U.S. sentence six. Secretary of Defense Chuck Hagel echoed Kerry's remark, saying "very high" when asked by Virginia Democratic Rep. Gerry Connolly about the likelihood of another Syrian chemical attack absent U.S. action.

Is there a way around this?

Thanks

user1926567
  • 153
  • 1
  • 3
  • 8
  • 5
    Please add the code you're using to your question – MDEV Sep 04 '13 at 22:18
  • Please edit the code you are currently using to split the sentence into your question. – Bad Wolf Sep 04 '13 at 22:18
  • I think you are referring to a "positive lookahead" – A.O. Sep 04 '13 at 22:19
  • 1
    Do you have an example string? – Explosion Pills Sep 04 '13 at 22:21
  • Take a look at [my answer to a similar question](http://stackoverflow.com/a/5844564/433790). It should work quite nicely but you'll need to add: `U.S.` and `Rep.` to the list of non-end-of-sentence special cases (like `Dr.`, `Mr.`, `Mrs.` etc), and add `:` to the list of sentence terminators (`[.!?]`). – ridgerunner Sep 05 '13 at 02:29

1 Answers1

2

Using a positive lookahead this should work....

$line = preg_split("[\.\?\!\:][\W]+(?=[A-Z])");

anything between the "(?=" and ")" is matched but not included in the result. Add appropriate repetition operators after last parenthesis.

searching for "regex look-arounds, lookaheads, look behinds, assertions" will yield a plethora of information on how to correctly use these features :-)

A.O.
  • 3,733
  • 6
  • 30
  • 49
  • Hi, this worked. However, when I add this sentence >> Secretary of Defense Chuck Hagel echoed Kerry's remark, saying "very high" when asked by Virginia Democratic Rep. Gerry Connolly about the likelihood of another Syrian chemical attack absent U.S. action. << The sentence is split up between Rep. and Gerry. Is there a way around this? – user1926567 Sep 04 '13 at 22:50
  • Sorry, I didnt read closely at first. This is a fairly special case, you could always do a negative lookbehind for "Rep" if you know that this abbreviation will be used often.......$line = preg_split("(?!Rep)[\.\?\!\:][\W]+(?=[A-Z])"); – A.O. Sep 04 '13 at 22:56
  • Hi, thanks for your reply. The thing is I don't know in advance what the word will be. Sometimes it will be "Rep." sometimes it will be something else. The only thing I do know is that the word will begin w a capital letter, I believe (if this helps at all). – user1926567 Sep 04 '13 at 22:59
  • Yeah I understand, it can be very difficult and frustrating attempting to make your regex work 100% correct, but this will never happen lol. especially when dealing with natural language, there are infinite possibilities. All you can do is to keep tweaking it when you notice special cases like these. For instance if you know that the content will always be about Politics, you can account for common political terms that are often abbreviated, instead of accounting for every possible abbreviation that ever existed. The word beginning with a capital letter is too broad... – A.O. Sep 04 '13 at 23:03