0

I am trying out nltk tutorial.

The problem I was facing was that it requires to download various corpora. After all solutions failed to solve the problem I was facing to download nltk corpora with nltk.download(), I resorted to steps stated here.

I started downloading corpora required for any example from this page, putting it in directory D:\nltk_data\corpora. I was able to try out various example. But then at one example I got error :

 Resource punkt not found.
 Please use the NLTK Downloader to obtain the resource:

 >>> import nltk
 >>> nltk.download('punkt')

So I downloaded punkt from same page and copy pasted in same above directory. But it did not worked. Also tried to do from nltk.corpus import punkt as in case of other corpora. But no use. It says Unresolved import: punkt

One difference in punkt from other corpora is that it contains pickle files instead of text files as in case of other corpora. How should I fix this?

Code:

import nltk;

from nltk.corpus import gutenberg

for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid)) 
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

Error:

Traceback (most recent call last):
  File "D:\Mahesh\workspaces\pyworkspace\nltkdemo\chp2\chp2.py", line 8, in <module>
    num_sents = len(gutenberg.sents(fileid))
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\corpus\reader\util.py", line 233, in __len__
    for tok in self.iterate_from(self._toknum[-1]): pass
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\corpus\reader\util.py", line 296, in iterate_from
    tokens = self.read_block(self._stream)
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\corpus\reader\plaintext.py", line 129, in _read_sent_block
    for sent in self._sent_tokenizer.tokenize(para)])
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\data.py", line 984, in __getattr__
    self.__load()
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\data.py", line 976, in __load
    resource = load(self._path)
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\data.py", line 836, in load
    opened_resource = _open(resource_url)
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\data.py", line 954, in _open
    return find(path_, path + ['']).open()
  File "D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\lib\nltk\data.py", line 675, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  Searched in:
    - 'C:\\Users\\593932/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'D:\\Softwares\\python\\WinPython-64bit-3.4.4.4Qt5\\python-3.4.4.amd64\\nltk_data'
    - 'D:\\Softwares\\python\\WinPython-64bit-3.4.4.4Qt5\\python-3.4.4.amd64\\share\\nltk_data'
    - 'D:\\Softwares\\python\\WinPython-64bit-3.4.4.4Qt5\\python-3.4.4.amd64\\lib\\nltk_data'
    - 'C:\\Users\\Mahesha999\\AppData\\Roaming\\nltk_data'
    - ''
**********************************************************************

The error seem to happen at line 8: num_sents = len(gutenberg.sents(fileid))

Mahesha999
  • 22,693
  • 29
  • 116
  • 189
  • Can you please update the question with actual code that is giving the error – oldmonk Mar 13 '18 at 10:54
  • The correct path for punkt folder is "nltk_data/tokenizers/punkt" . Try to put the folder in that path and try again. – Sumit S Chawla Mar 13 '18 at 10:59
  • @Sam tried, rest of corpora is at `D:\nltk_data\corpora\`, `punkt` is at `D:\nltk_data\corpora\tokenizers`, not working... – Mahesha999 Mar 13 '18 at 11:05
  • @Sam put the same in `D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\nltk_data` and `D:\Softwares\python\WinPython-64bit-3.4.4.4Qt5\python-3.4.4.amd64\nltk_data\tokenizers`, it started working!!! – Mahesha999 Mar 13 '18 at 11:06
  • @Mahesha999 : not D:\nltk_data\corpora\tokenizers but nltk_data/tokenizers/ path . corpora and tokenizers are two different folders inside nltk_data. – Sumit S Chawla Mar 13 '18 at 11:07
  • yup done... thanks for correcting me. One more thing. How this gutenberg works? If I delete `nltk_data\corpora\gutenberg`, it does not work. When kept, it works. But I was guessing how I am able to call `.raw()`, `words()` on it. `print(gutenberg)` prints ``. So importing gutenberg imports `PlaintextCorpusReader`!! I am not able to understand this. How this import is behaving exactly? Cause there is not `PlaintextCorpusReader` on that path `D:\\nltk_data\\corpora\\gutenberg'>` – Mahesha999 Mar 13 '18 at 11:31
  • Please see https://stackoverflow.com/questions/22211525/how-do-i-download-nltk-data – alvas Mar 13 '18 at 14:44
  • Download with `nltk.download('punkt')` and don't move anything. The NLTK package will automatically find it. – alvas Mar 13 '18 at 14:48
  • Yes I know that, I already tried that and [had lots of issue](https://stackoverflow.com/questions/49130879/setting-up-ntlk-proxy) as stated in the question? I will be happy if you resolve that too. – Mahesha999 Mar 13 '18 at 15:26

0 Answers0