0

I'm struggling with the problem to cut the very first sentence from the string. It wouldn't be such a problem if I there were no abbreviations ended with dot.

So my example is:

  • string = 'I like cheese, cars, etc. but my the most favorite website is stackoverflow. My new horse is called Randy.'

And the result should be:

  • result = 'I like cheese, cars, etc. but my the most favorite website is stackoverflow.'

Normally I would do with:

re.findall(r'^(\s*.*?\s*)(?:\.|$)', event)

but I would like to skip some pre-defined words, like above mentioned etc.

I came with couple of expression but none of them worked.

juliomalegria
  • 24,229
  • 14
  • 73
  • 89
skornos
  • 3,121
  • 1
  • 26
  • 30
  • 4
    http://stackoverflow.com/a/1732454/203705 -- doubly true for natural language. What are you really trying to do? What's the bigger problem you're trying to solve? – Ben Burns Apr 03 '12 at 15:44
  • I have got a string with some event reference and I am creating acronym for this event. As I have found out the basic information is in the first sentence with the name of the event, but sometimes there are used abbreviations like intern. = international and so on and this can really be pain in the ass. – skornos Apr 03 '12 at 15:49
  • Last time I posted the Tony the Pony link, [tchrist](http://stackoverflow.com/users/471272/tchrist) came out of the woodwork and argued me into submission. Anyway, are there any constraints, or are we talking about arbitrary English sentences? It would be helpful if you could say that your list of pre-defined words (including `etc.`) never show up at the end of the sentence, or that a sentence always begins with a capital letter and the word after `etc.` never does. – cha0site Apr 03 '12 at 15:49

2 Answers2

4

You could try NLTK's Punkt sentence tokenizer, which does this kind of thing using a real algorithm to figure out what the abbreviations are instead of your ad-hoc collection of abbreviations.

NLTK includes a pre-trained one for English; load it with:

nltk.data.load('tokenizers/punkt/english.pickle')

From the source code:

>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences 
can start with non-capitalized words.
-----
i is a good variable
name.
Danica
  • 28,423
  • 6
  • 90
  • 122
  • +1, this thing's great. English is tough and not well-suited for a regex. – Chris Eberle Apr 03 '12 at 15:51
  • Thank you, this seems like ideal solution because I am actually using NLTK but just for parsing the words of the sentence. But even although I have the library it seems it raises LookupError, so I guess I am missing some part of it – skornos Apr 03 '12 at 15:59
  • @skornos You're probably missing the appropriate data file, as in [this question](http://stackoverflow.com/questions/4867197/failed-loading-english-pickle-with-nltk-data-load). – Danica Apr 03 '12 at 16:14
  • Damn, so it looks that words like Int. or Conf. aren't among defined abbreviations in the library :( so it divides the sentence in a bad way – skornos Apr 03 '12 at 19:31
  • @skornos Hmm. You can try training on a different corpus that includes those abbreviations (I don't know what that one is trained on); I bet you could also add abbreviations manually, though it's not documented on how to do so. You could investigate [the source](http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-pysrc.html) to figure out how to do that, or do some googling. – Danica Apr 03 '12 at 20:28
1

How about looking for the first capital letter after a sentence-ending character? It's not foolproof, of course.

import re
r = re.compile("^(.+?[.?!])\s*[A-Z]")
print r.match('I like cheese, cars, etc. but my the most favorite website is stackoverflow. My new horse is called Randy.').group(1)

outputs

'I like cheese, cars, etc. but my the most favorite website is stackoverflow.'
AKX
  • 152,115
  • 15
  • 115
  • 172