0
import nltk

text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital
computer or the gears of a cycle transmission as he does at the top of a mountain
or in the petals of a flower. To think otherwise is to demean the Buddha...which is
to demean oneself."""


sentence_re = r'''(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'''

toks = nltk.regexp_tokenize(text, sentence_re)

but I get:

  File "C:\Users\AppData\Local\Continuum\Anaconda2\envs\Python35\lib\sre_parse.py", line 638, in _parse
    source.tell() - here + len(this))

error: nothing to repeat

I understand previously there was a bug, but I am using the latest NLTK and Python3.5 where I am lead to believe I should not be experiencing the bug. Anyone have any idea what is going on?

Run within Spyder3 from a Python 3.5 virtualenv

The regex is trying to obtain (in order):

  • abbreviations
  • (optional) hyphenated words
  • currency and percentages
  • ellipsis and ad-hoc tokens i.e. ? [ ( : etc etc...
ctwheels
  • 21,901
  • 9
  • 42
  • 77
brucezepplin
  • 9,202
  • 26
  • 76
  • 129
  • Please post your desired output. – Ajax1234 Oct 18 '17 at 14:07
  • 3
    You cannot quantify the end of string `$` as you did - `$?`. Without the exact requirements, we can't help you improve/fix the pattern. There are other thing that are obvious errors - `:-_\`]` must be writtern as `:_\`-]`, dots that match literal dots must be escaped. See https://ideone.com/fzQCZD – Wiktor Stribiżew Oct 18 '17 at 14:08
  • 1
    I'm not sure what you're trying to do (grab words?), but `.` is a special character in regex (it means any character [except the newline character unless the `s` flag is used] and not when it's used in a set), therefore, you need to escape the `.` with `\.` If you're simply trying to get words, use `[\w-]+` – ctwheels Oct 18 '17 at 14:11
  • Hi - have added what I am trying to achieve in last line of the post. – brucezepplin Oct 18 '17 at 14:19
  • @brucezepplin what do you mean by `in order`? Also, what defines an abbreviation? Are `.` permitted? What character defines a hyphenated word (I assume `-`, but see this Wikipedia article for more options: [The hyphen in other languages](https://en.wikipedia.org/wiki/Hyphen#The_hyphen_in_other_languages)), what currency/currencies (there are so many currency symbols). If you're looking for multiple you may want to look at this post [Python regex matching Unicode properties](https://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties) since python doesn't allow... – ctwheels Oct 18 '17 at 14:22
  • sorry - in the way the regex is written ie. the first part `(?:(?:[A-Z])(?:.[A-Z])+.?)` should be abbreviations – brucezepplin Oct 18 '17 at 14:23
  • 1
    ... Unicode groups by default (such as `\p{Sc}`, which is the easiest way to grab any unicode currency symbol). Also, what do you mean by ellipsis and ad-hoc tokens **etc.** do you want to grab all visible symbols? Again, unicode would be super helpful here with `\p{S}\p{P}` (symbols, punctuation) – ctwheels Oct 18 '17 at 14:29
  • So if I understand properly you want specificity instead of the order in terms of each section of the regex? So abbreviations are *considered* more important than hyphenated words, which, in turn, are *considered* more important than currency and percentages, etc. – ctwheels Oct 18 '17 at 14:30
  • @ctwheels thanks for the suggestion. already this has been helpful because I think the error is down to malformed regex. I will give the unicode a go. – brucezepplin Oct 18 '17 at 14:30
  • @ctwheels, I don't want to to embed importance, merely tokens I would like to extract i.e. one is not important than any other. – brucezepplin Oct 18 '17 at 14:31
  • You have to embed importance in some respect since regex will match in order of the options so, for example `.|[a-z]` is malformed regex because the first option `.` will **always** match any character (`a-z` included), therefore, in this case, `[a-z]` will never be matched. A better solution would be to use `[a-z]|.` since specificity matters (obviously this is a terrible example, but it does show you the difference in order of elements) – ctwheels Oct 18 '17 at 14:34
  • @ctwheels that's not "malformed", since it's legal; it's just useless. Malformed regexes trigger an error. – alexis Oct 18 '17 at 17:28
  • @brucezepplin Did you see https://ideone.com/fzQCZD? – Wiktor Stribiżew Oct 19 '17 at 07:10
  • @brucezepplin the error in your regex is caused by the ? in ((?:$?\d+(?:.\d+)?%?)) as ? I get the preceding token is not quantifiable when I try it online in a tool. The solution to that is to escape it with \. So your that sub-expression should be ((?:$\?\d+(?:.\d+)?%?)) instead of ((?:$?\d+(?:.\d+)?%?)). It will remove the error. I am not sure about the output as i don't know desired output from the question. – utengr Oct 19 '17 at 09:55
  • Your script works fine if you make that change as I tested it. However, not sure about the desired result. It looks like you are trying to match ? there which might not be what you want to do. As its quantifier, you need to escape it to match it. – utengr Oct 19 '17 at 09:56
  • 1
    https://regex101.com/r/jH2dN5/1 I posted an example for you here. Just play with it to see your regex works fine or not. – utengr Oct 19 '17 at 10:13
  • @engr_s thanks very much will check this out – brucezepplin Oct 19 '17 at 10:25

1 Answers1

0

The error you get is related to the fact that you quantified a $ end-of-string anchor. An unescaped $ is a zero-width assertion that match at the end of the string. To match a literal $, you need to escape it.

The . chars in your expression also need to get escaped to match literal dots.

However, there is also a problem with - forming a range in the character class in [][.,;"'?():-_`]. To make sure - matches a -, put it at the end, before the last ].

Besides, it seems you want to match words that do not contain underscores (as you placed the _ in the last character class). Thus, I suggest to subtract _ form \w pattern, and replace \w+(?:-\w+)* with [^\W_]+(?:-[^\W_]+)*.

Here is a pattern with my suggestions implemented:

sentence_re = r'''\$?\d+(?:\.\d+)?%?|[A-Z](?:\.[A-Z])+\.?|[^\W_]+(?:-[^\W_]+)*|(?:\.{3}|)[][.,;"'?():_`-]'''

See the regex demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563