nltk / re : nothing to repeat while trying to tokenize with regex

Question

import nltk

text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital
computer or the gears of a cycle transmission as he does at the top of a mountain
or in the petals of a flower. To think otherwise is to demean the Buddha...which is
to demean oneself."""


sentence_re = r'''(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'''

toks = nltk.regexp_tokenize(text, sentence_re)

but I get:

  File "C:\Users\AppData\Local\Continuum\Anaconda2\envs\Python35\lib\sre_parse.py", line 638, in _parse
    source.tell() - here + len(this))

error: nothing to repeat

I understand previously there was a bug, but I am using the latest NLTK and Python3.5 where I am lead to believe I should not be experiencing the bug. Anyone have any idea what is going on?

Run within Spyder3 from a Python 3.5 virtualenv

The regex is trying to obtain (in order):

abbreviations
(optional) hyphenated words
currency and percentages
ellipsis and ad-hoc tokens i.e. ? [ ( : etc etc...

You cannot quantify the end of string `$` as you did - `$?`. Without the exact requirements, we can't help you improve/fix the pattern. There are other thing that are obvious errors - `:-_\`]` must be writtern as `:_\`-]`, dots that match literal dots must be escaped. See https://ideone.com/fzQCZD — Wiktor Stribiżew, Oct 18 '17 at 14:08
I'm not sure what you're trying to do (grab words?), but `.` is a special character in regex (it means any character [except the newline character unless the `s` flag is used] and not when it's used in a set), therefore, you need to escape the `.` with `\.` If you're simply trying to get words, use `[\w-]+` — ctwheels, Oct 18 '17 at 14:11
Hi - have added what I am trying to achieve in last line of the post. — brucezepplin, Oct 18 '17 at 14:19
@brucezepplin what do you mean by `in order`? Also, what defines an abbreviation? Are `.` permitted? What character defines a hyphenated word (I assume `-`, but see this Wikipedia article for more options: [The hyphen in other languages](https://en.wikipedia.org/wiki/Hyphen#The_hyphen_in_other_languages)), what currency/currencies (there are so many currency symbols). If you're looking for multiple you may want to look at this post [Python regex matching Unicode properties](https://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties) since python doesn't allow... — ctwheels, Oct 18 '17 at 14:22
sorry - in the way the regex is written ie. the first part `(?:(?:[A-Z])(?:.[A-Z])+.?)` should be abbreviations — brucezepplin, Oct 18 '17 at 14:23
... Unicode groups by default (such as `\p{Sc}`, which is the easiest way to grab any unicode currency symbol). Also, what do you mean by ellipsis and ad-hoc tokens **etc.** do you want to grab all visible symbols? Again, unicode would be super helpful here with `\p{S}\p{P}` (symbols, punctuation) — ctwheels, Oct 18 '17 at 14:29
So if I understand properly you want specificity instead of the order in terms of each section of the regex? So abbreviations are *considered* more important than hyphenated words, which, in turn, are *considered* more important than currency and percentages, etc. — ctwheels, Oct 18 '17 at 14:30
@ctwheels thanks for the suggestion. already this has been helpful because I think the error is down to malformed regex. I will give the unicode a go. — brucezepplin, Oct 18 '17 at 14:30
@ctwheels, I don't want to to embed importance, merely tokens I would like to extract i.e. one is not important than any other. — brucezepplin, Oct 18 '17 at 14:31
You have to embed importance in some respect since regex will match in order of the options so, for example `.|[a-z]` is malformed regex because the first option `.` will **always** match any character (`a-z` included), therefore, in this case, `[a-z]` will never be matched. A better solution would be to use `[a-z]|.` since specificity matters (obviously this is a terrible example, but it does show you the difference in order of elements) — ctwheels, Oct 18 '17 at 14:34
@ctwheels that's not "malformed", since it's legal; it's just useless. Malformed regexes trigger an error. — alexis, Oct 18 '17 at 17:28
@brucezepplin the error in your regex is caused by the ? in ((?:$?\d+(?:.\d+)?%?)) as ? I get the preceding token is not quantifiable when I try it online in a tool. The solution to that is to escape it with \. So your that sub-expression should be ((?:$\?\d+(?:.\d+)?%?)) instead of ((?:$?\d+(?:.\d+)?%?)). It will remove the error. I am not sure about the output as i don't know desired output from the question. — utengr, Oct 19 '17 at 09:55
Your script works fine if you make that change as I tested it. However, not sure about the desired result. It looks like you are trying to match ? there which might not be what you want to do. As its quantifier, you need to escape it to match it. — utengr, Oct 19 '17 at 09:56
https://regex101.com/r/jH2dN5/1 I posted an example for you here. Just play with it to see your regex works fine or not. — utengr, Oct 19 '17 at 10:13

score 0 · Answer 1 · answered Oct 20 '17 at 07:02

The error you get is related to the fact that you quantified a $ end-of-string anchor. An unescaped $ is a zero-width assertion that match at the end of the string. To match a literal $, you need to escape it.

The . chars in your expression also need to get escaped to match literal dots.

However, there is also a problem with - forming a range in the character class in [][.,;"'?():-_`]. To make sure - matches a -, put it at the end, before the last ].

Besides, it seems you want to match words that do not contain underscores (as you placed the _ in the last character class). Thus, I suggest to subtract _ form \w pattern, and replace \w+(?:-\w+)* with [^\W_]+(?:-[^\W_]+)*.

Here is a pattern with my suggestions implemented:

sentence_re = r'''\$?\d+(?:\.\d+)?%?|[A-Z](?:\.[A-Z])+\.?|[^\W_]+(?:-[^\W_]+)*|(?:\.{3}|)[][.,;"'?():_`-]'''

See the regex demo

nltk / re : nothing to repeat while trying to tokenize with regex

1 Answers1