3

I'm using following regex, it suppose to find the string 'U.S.A.', but it only gets 'A.', is anyone know what's wrong?

#INPUT
import re

text = 'That U.S.A. poster-print costs $12.40...'

print re.findall(r'([A-Z]\.)+', text)

#OUTPUT
['A.']

Expected Output:

['U.S.A.']

I'm following the NLTK Book, Chapter 3.7 here, it has a set of regex but it just not workin. I've tried it in both Python 2.7 and 3.4.

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

nltk.regexp_tokenize() works the same as re.findall(), I think somehow my python here does not recognize the regex as expected. The regex listed above output this:

[('', '', ''),
 ('A.', '', ''),
 ('', '-print', ''),
 ('', '', ''),
 ('', '', '.40'),
 ('', '', '')]
alvas
  • 115,346
  • 109
  • 446
  • 738
LingxB
  • 497
  • 3
  • 7
  • 17
  • Since you didn't mentioned a pattern and If your **only** motive is to find `U.S.A.` using `(U.S.A.)` will be suffice. –  Jan 31 '16 at 20:00
  • See https://github.com/nltk/nltk/issues/1206 and http://stackoverflow.com/questions/32300437/python-parsing-user-input-using-a-verbose-regex and http://stackoverflow.com/questions/22175923/nltk-regexp-tokenizer-not-playing-nice-with-decimal-point-in-regex – alvas Jan 31 '16 at 20:15

4 Answers4

3

Possibly, it's something to do with how regexes were compiled previously using nltk.internals.compile_regexp_to_noncapturing() that is abolished in v3.1, see here)

>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/\-@&*]        # special characters with meanings
...             '''
>>> 
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

But it doesn't work in NLTK v3.1:

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               | [+/\-@&*]        # special characters with meanings
...             '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]

With slight modification of how you define your regex groups, you could get the same pattern to work in NLTK v3.1, using this regex:

pattern = r"""(?x)                   # set flag to allow verbose regexps
              (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
              |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
              |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
              |(?:[+/\-@&*])         # special characters with meanings
            """

In code:

>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x)                   # set flag to allow verbose regexps
... (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-@&*])         # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

Without NLTK, using python's re module, we see that the old regex patterns are not supported natively:

>>> pattern1 = r"""(?x)               # set flag to allow verbose regexps
...               ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
...               |\$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
...               |\w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
...               |[+/\-@&*]        # special characters with meanings
...               |\S\w*                       # any sequence of word characters# 
... """            
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x)                   # set flag to allow verbose regexps
...                       (?:[A-Z]\.)+           # abbreviations, e.g. U.S.A.
...                       |\d+(?:\.\d+)?%?       # numbers, incl. currency and percentages
...                       |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe
...                       |(?:[+/\-@&*])         # special characters with meanings
...                     """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']

Note: The change in how NLTK's RegexpTokenizer compiles the regexes would make the examples on NLTK's Regular Expression Tokenizer obsolete too.

alvas
  • 115,346
  • 109
  • 446
  • 738
2

Drop the trailing +, or put it inside the group:

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> re.findall(r'([A-Z]\.)+', text)
['A.']              # wrong
>>> re.findall(r'([A-Z]\.)', text)
['U.', 'S.', 'A.']  # without '+'
>>> re.findall(r'((?:[A-Z]\.)+)', text)
['U.S.A.']          # with '+' inside the group
Andrea Corbellini
  • 17,339
  • 3
  • 53
  • 69
1

The first part of the text that the regexp matches is "U.S.A." because ([A-Z]\.)+ matches the first group (part within parenthesis) three times. However you can only return one match per group, so Python picks the last match for that group.

If you instead change the regular expression to include the "+" in the group, then the group will only match once and the full match will be returned. For example (([A-Z]\.)+) or ((?:[A-Z]\.)+).

If you instead want three separate results, then just get rid of the "+" sign in the regular expression and it will only match one letter and one dot for each time.

Jonas Berlin
  • 3,344
  • 1
  • 27
  • 33
1

The problem is the "capturing group", aka the parentheses, which have an unexpected effect on the result of findall(): When a capturing group is utilized multiple times in a match, the regexp engine loses track and strange things happen. Specifically: the regexp correctly matches the entire U.S.A., but findall drops it on the floor and only returns the last group capture.

As this answer says, the re module doesn't support repeated capturing groups, but you could install the alternative regexp module that does handle this correctly. (However, this would be no help to you if you want to pass your regexp to nltk.tokenize.regexp.)

Anyway to match U.S.A. correctly, use this: r'(?:[A-Z]\.)+', text).

>>> re.findall(r'(?:[A-Z]\.)+', text)
['U.S.A.']

You can apply the same fix to all repeated patterns in the NLTK regexp, and everything will work correctly. As @alvas suggested, the NLTK used to make this substitution behind the scenes, but this feature was recently dropped and replaced with a warning in the documentation of the tokenizer. The book is clearly out of date; @alvas filed a bug report about it back in November, but it hasn't been acted on yet...

Community
  • 1
  • 1
alexis
  • 48,685
  • 16
  • 101
  • 161