Python regex test the sentence is valid

Question

ACTIVE_LIST = ACTOR | ACTIVE_LIST and ACTOR
ACTOR = NOUN | ARTICLE NOUN
ARTICLE = a | the
NOUN = tom | jerry | goofy | mickey | jimmy | dog | cat | mouse

By applying above rule I can generate

a tom 
tom and a jerry 
the tom and a jerry 
the tom and a jerry and tom and dog

but not

Tom 
the Tom and me

can I check the sentence is correct by only using python re module. I know how to match certain char by [abc] but don't know about word. Actually I am trying to solve this ACM problem. If someone assist me partially I can do the rest. This is my 1st question at this arena. Any suggestion or improvement highly appreciated.

See the docs https://docs.python.org/2/library/re.html, you can use re.IGNORECASE. and [A-Z] will match lower and upper case. — rfkortekaas, Dec 31 '15 at 08:29
This is not best described with regular expressions. It is a parsing problem and tools like [the pyparsing module at PyPI](https://pypi.python.org/pypi/pyparsing/2.0.7) would get you looking at the problem as the writers seem to think you ought. — msw, Dec 31 '15 at 23:37

score 2 · Answer 1 · edited May 23 '17 at 12:33

2

Use re.compile

re.compile('tom', re.IGNORECASE)

In this following topic, you will have other way to do without re.compile. (search / match)

Case insensitive Python regular expression without re.compile

edited May 23 '17 at 12:33

Community

1
1

answered Dec 31 '15 at 08:45

mmoustai

41
6

score 1 · Answer 2 · answered Dec 31 '15 at 16:17

This can be seen as an NLP (Natural Language Processing) problem. There is a special python module called NLTK (Natural Language Toolkit) that can be best used to solve this task, easier done than with regular expressions.

1) First you need to download the NLTK (http://www.nltk.org/install.html)

2) Import NLTK:

import nltk

3) Create a small grammar, a context free grammar containing your four rules (https://en.wikipedia.org/wiki/Context-free_grammar). By means of the CFG module from NLTK, you can easily do that with one line of code:

acm_grammar = nltk.CFG.fromstring("""
ACTIVE_LIST -> ACTOR | ACTIVE_LIST 'and' ACTOR
ACTOR -> NOUN | ARTICLE NOUN
ARTICLE -> 'a' | 'the'
NOUN -> 'tom' | 'jerry' | 'goofy' | 'mickey' | 'jimmy' | 'dog' | 'cat' | 'mouse' """)

4) Create a parser that will use the acm_grammar:

parser = nltk.ChartParser(acm_grammar)

5) Test it on some input. Input sentences must be in the form of a list with comma-separated words (strings). The split() method can be used for this:

input= ["a tom", "tom and a jerry", "the tom and a jerry","the tom and a jerry and tom and dog","Tom", "the Tom and me"]

for sent in input:
    split_sent = sent.split()
    try:
        parser.parse(split_sent)
        print(sent,"-- YES I WILL")
    except ValueError:
        print(sent,"-- NO I WON'T")

In this last step, we check if the parser can parse a sentence according to the acm_grammar. If it cannot, the call to the parser will result in a ValueError. Here is the output of this code:

a tom -- YES I WILL
tom and a jerry -- YES I WILL
the tom and a jerry -- YES I WILL
the tom and a jerry and tom and dog -- YES I WILL
Tom -- NO I WON'T
the Tom and me -- NO I WON'T

I have up-voted your answer as this is better than past. tnx to read the link. Best answer will be accepted. — Saiful Azad, Dec 31 '15 at 16:27
That's kinda using a sledgehammer to swat flies. It does the job, but if the OP's (unstated) desire is understanding parsing, that is lost. — msw, Dec 31 '15 at 23:29

score 1 · Accepted Answer · answered Dec 31 '15 at 22:49

Yes, you can write that as a regex pattern, because the grammar is regular. The regular expression will be pretty long, but it could be generated in a fairly straight-forward way; once you have the regex, you just compile it and apply it to each input.

The key is to turn regular rules into repetitions. For example,

STATEMENT = ACTION | STATEMENT , ACTION

can be turned into

ACTION (, ACTION)*

Of course, that's just a part of the problem, because you'd first have to have transformed ACTION into a regular expression in order to create the regex for STATEMENT.

The problem description glosses over an important issue, which is that the input does not just consist of lower-case alphabetic characters and commas. It also contains spaces, and the regular expression needs to insist on spaces at appropriate points. For example, the , above probably must (and certainly might) be followed by one (or more) spaces. It might be ok if it were preceded by a one or more spaces, too; the problem description isn't clear.

So the correction regular expression for NOUN will actually turn out to be:

((a|the) +)?(tom|jerry|goofy|mickey|jimmy|dog|cat|mouse)

(I also found it interesting that the grammar as presented lets VERB match "hatesssssssss". I have no idea whether that was intentional.)

score 0 · Answer 4 · answered Dec 31 '15 at 17:55

After thinking a lot I have solved it at my own

ARTICLE = ( 'a', 'the')
NOUN = ('tom' , 'jerry' , 'goofy' , 'mickey' , 'jimmy' , 'dog' , 'cat' , 'mouse')

all_a = NOUN +tuple([' '.join([x,y]) for x in ARTICLE for y in NOUN])


def aseKi(str):
    return str in all_a

st = 'the tom and jerry'
st1 = 'tom and a jerry'

st2 = 'tom and jerry and the mouse'

st = 'tom and goofy and goofy and the goofy and a dog and cat'

val = st.split('and')

nice_val = [x.strip() for x in val]


s = [aseKi(x) for x in nice_val]

if all(s):
    print 'YES I WILL'
else:
    print "NO I WON'T"

Python regex test the sentence is valid

4 Answers4