import nltk
text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital
computer or the gears of a cycle transmission as he does at the top of a mountain
or in the petals of a flower. To think otherwise is to demean the Buddha...which is
to demean oneself."""
sentence_re = r'''(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'''
toks = nltk.regexp_tokenize(text, sentence_re)
but I get:
File "C:\Users\AppData\Local\Continuum\Anaconda2\envs\Python35\lib\sre_parse.py", line 638, in _parse
source.tell() - here + len(this))
error: nothing to repeat
I understand previously there was a bug, but I am using the latest NLTK and Python3.5 where I am lead to believe I should not be experiencing the bug. Anyone have any idea what is going on?
Run within Spyder3 from a Python 3.5 virtualenv
The regex is trying to obtain (in order):
- abbreviations
- (optional) hyphenated words
- currency and percentages
- ellipsis and ad-hoc tokens i.e.
? [ ( :
etc etc...