75

Is there a ready-to-use English grammar that I can just load it and use in NLTK? I've searched around examples of parsing with NLTK, but it seems like that I have to manually specify grammar before parsing a sentence.

Thanks a lot!

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
roboren
  • 891
  • 1
  • 7
  • 5

8 Answers8

33

You can take a look at pyStatParser, a simple python statistical parser that returns NLTK parse Trees. It comes with public treebanks and it generates the grammar model only the first time you instantiate a Parser object (in about 8 seconds). It uses a CKY algorithm and it parses average length sentences (like the one below) in under a second.

>>> from stat_parser import Parser
>>> parser = Parser()
>>> print parser.parse("How can the net amount of entropy of the universe be massively decreased?")
(SBARQ
  (WHADVP (WRB how))
  (SQ
    (MD can)
    (NP
      (NP (DT the) (JJ net) (NN amount))
      (PP
        (IN of)
        (NP
          (NP (NNS entropy))
          (PP (IN of) (NP (DT the) (NN universe))))))
    (VP (VB be) (ADJP (RB massively) (VBN decreased))))
  (. ?))
emilmont
  • 783
  • 8
  • 5
  • For Python 3 users, there's a pull request to add Python 3 support here: https://github.com/emilmont/pyStatParser/pull/7 I only found out about that pull request after using the `2to3` tool to "manually" convert all the files from Python 2 to Python 3. – VinceFior Apr 15 '16 at 01:33
  • To build the grammar model and run an example: `python example.py` with the default text hardcoded. Very easy to use and embeddable. – loretoparisi Nov 21 '16 at 17:17
  • I've issued these commands to be able to use pyStatParser `2to3 --output-dir=stat_parser3 -W -n stat_parser` `rm star_parser` `mv stat_parser3 stat_parser` `setup.py build` `setup.py install` and it worked, thanks @emilmont – Mancy Feb 14 '17 at 21:13
  • The library would parse "The Sun rises from the East" as - ``(SINV (NP (NP (DT the) (NNP Sun) (NNP rises)) (PP (IN from) (NP (DT the) (NNP East)))) (. .)) `` Shouldn't "rises" be a ``VP``? How do we avoid interpreting "rises" as a proper noun? – argmin Jul 09 '18 at 10:50
26

My library, spaCy, provides a high performance dependency parser.

Installation:

pip install spacy
python -m spacy.en.download all

Usage:

from spacy.en import English
nlp = English()
doc = nlp(u'A whole document.\nNo preprocessing require.   Robust to arbitrary formating.')
for sent in doc:
    for token in sent:
        if token.is_alpha:
            print token.orth_, token.tag_, token.head.lemma_

Choi et al. (2015) found spaCy to be the fastest dependency parser available. It processes over 13,000 sentences a second, on a single thread. On the standard WSJ evaluation it scores 92.7%, over 1% more accurate than any of CoreNLP's models.

syllogism_
  • 4,127
  • 29
  • 22
  • 1
    thank you for this, I'm excited to check out spaCy. Is there a way to selectively import only the minimal amount of data necessary to parse your example sentence? Whenever I run `spacy.en.download all` it initiates a download that appears to be over 600 MB! – wil3 Jan 03 '16 at 23:57
  • In addition, my empty 1GB RAM vagrant box doesn't seem to be able to handle the memory required by spaCy and faults with a MemoryError. I'm assuming it's loading the whole dataset into memory? – Xeoncross Feb 04 '16 at 23:12
  • You can't only load the data necessary to parse one sentence, no — the assumed usage is that you'll parse arbitrary text. It does require 2-3gb of memory per process. We expect the memory requirements to go down when we finish switching over to a neural network. In the meantime, we've added multi-threading support, so that you can amortise the memory requirement across multiple CPUs. – syllogism_ Feb 07 '16 at 01:36
  • 2
    Note that the correct usage is now `for sent in doc.sents:` – Phylliida Sep 02 '16 at 20:16
  • @JamesKo API changed, use: `import spacy`, then `nlp = spacy.load('en')` , and then process your sentences as: `doc = nlp(u'Your unprocessed document here`) – Carlo Mazzaferro Apr 18 '18 at 17:50
  • It is now `python -m spacy download en` – Sam Redway Nov 18 '18 at 18:10
6

There is a Library called Pattern. It is quite fast and easy to use.

>>> from pattern.en import parse
>>>  
>>> s = 'The mobile web is more important than mobile apps.'
>>> s = parse(s, relations=True, lemmata=True)
>>> print s

'The/DT/B-NP/O/NP-SBJ-1/the mobile/JJ/I-NP/O/NP-SBJ-1/mobile' ... 
user3798928
  • 69
  • 1
  • 2
6

There are a few grammars in the nltk_data distribution. In your Python interpreter, issue nltk.download().

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 1
    Yes, but it's not sufficient for an arbitrary sentence. When I try some random sentence, it shows "Grammar does not cover some of the input words: ...." Am I doing it wrong? I want to get a parse tree of a sentence. Is this the right way to do it? Thanks! – roboren May 24 '11 at 21:33
  • 6
    @roboren: you could take the Penn treebank portion in `nltk_data` and derive a CFG from it by simply turning tree fragments (a node and its direct subnodes) into rules. But you probably won't find a "real" grammar unless you look into statistical parsing; no-one builds non-stochastic grammars anymore since they just don't work, except for very domain-specific applications. – Fred Foo May 24 '11 at 21:36
  • 2
    Does nltk provide statistical parsing? Otherwise, I may want to switch to Stanford parser. Once again, thank you very much =) – roboren May 24 '11 at 22:21
  • Yes: http://nltk.googlecode.com/svn-history/r7492/trunk/doc/api/nltk.parse.ViterbiParse-class.html. Not sure if you have to derive the grammar for this yourself, though. – Fred Foo May 25 '11 at 06:11
5

Use the MaltParser, there you have a pretrained english-grammar, and also some other pretrained languages. And the Maltparser is a dependency parser and not some simple bottom-up, or top-down Parser.

Just download the MaltParser from http://www.maltparser.org/index.html and use the NLTK like this:

import nltk
parser = nltk.parse.malt.MaltParser()
blackmamba
  • 141
  • 1
  • 6
  • 4
    MaltParser looks good, but I wasn't able to get it working with nltk (it kept failing with the message "Couldn't find the MaltParser configuration file: malt_temp.mco". The MaltParser itself, I got working fine. – Nathaniel Waisbrot Aug 27 '12 at 04:36
4

I've tried NLTK, PyStatParser, Pattern. IMHO Pattern is best English parser introduced in above article. Because it supports pip install and There is a fancy document on the website (http://www.clips.ua.ac.be/pages/pattern-en). I couldn't find reasonable document for NLTK (And it gave me inaccurate result for me by its default. And I couldn't find how to tune it). pyStatParser is much slower than described above in my Environment. (About one minute for initialization and It took couple of seconds to parse long sentences. Maybe I didn't use it correctly).

Piyo Hoge
  • 49
  • 1
  • 1
    Pattern doesn't seem to be doing parsing (as in, [dependency parsing](http://en.wikipedia.org/wiki/Dependency_grammar)), only POS-tagging and maybe chunking. It's fairly normal for parsers to take a while on long sentences. – Nikana Reklawyks Apr 19 '15 at 18:33
  • 1
    @NikanaReklawyks exactly, the right `nltk` tool here is like `PyStatParser` that builds a grammar that is a `PCFG` grammar i.e. Probabilistic Context-Free Grammars - http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/pcfgs.pdf – loretoparisi Nov 21 '16 at 17:22
4

Did you try POS tagging in NLTK?

text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)

The answer is something like this

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),('completely', 'RB'), ('different', 'JJ')]

Got this example from here NLTK_chapter03

1

I'm found out that nltk working good with parser grammar developed by Stanford.

Syntax Parsing with Stanford CoreNLP and NLTK

It is very easy to start to use Stanford CoreNLP and NLTK. All you need is small preparation, after that you can parse sentences with following code:

from nltk.parse.corenlp import CoreNLPParser
parser = CoreNLPParser()
parse = next(parser.raw_parse("I put the book in the box on the table."))

Preparation:

  1. Download Java Stanford model
  2. Run CoreNLPServer

You can use following code to run CoreNLPServer:

import os
from nltk.parse.corenlp import CoreNLPServer
# The server needs to know the location of the following files:
#   - stanford-corenlp-X.X.X.jar
#   - stanford-corenlp-X.X.X-models.jar
STANFORD = os.path.join("models", "stanford-corenlp-full-2018-02-27")
# Create the server
server = CoreNLPServer(
   os.path.join(STANFORD, "stanford-corenlp-3.9.1.jar"),
   os.path.join(STANFORD, "stanford-corenlp-3.9.1-models.jar"),    
)
# Start the server in the background
server.start()

Do not forget stop server with executing server.stop()