2

I was following first chapter of the nltk book. It asks us to install book corpus by running nltk.dowwnload().

I am getting getattrinfo failed error while doing ntlk.download(). After reading online, I came to know that this has something to do with my proxy.

enter image description here

So I tried setting proxies in different ways (trying http or https, %40 or @ in password):

nltk.set_proxy('http://proxy.mycompany.com:8080',('123456','password%40123'))
nltk.set_proxy('http://proxy.mycompany.com:8080',('123456','password@123'))
nltk.set_proxy('https://proxy.mycompany.com:8080',('123456','password%40123'))
nltk.set_proxy('https://proxy.mycompany.com:8080',('123456','password@123'))

(I was able to succesfully set proxy for pip and install nltk, but not sure if I am making mistake in case of nltk proxy)

Then I also tried

C:\Users\123456>python -m nltk.downloader all
[nltk_data] Error loading all: <urlopen error [Errno 11004]
[nltk_data]     getaddrinfo failed>
Error installing package. Retry? [n/y/e]

Next I tried

>>>nltk.download('book') 

But this too gives same error:

>>> nltk.download('book')
[nltk_data] Error loading book: <urlopen error [Errno 11004]
[nltk_data]     getaddrinfo failed>

Then I also tried by changing server index url as suggested here, but no use. Also the pre populated index is alive (and I am able to open it in the browser), so I guess I need not change server index url.

Mahesha999
  • 22,693
  • 29
  • 116
  • 189
  • The precise URL you are attempting to access and the diagnostics to suggest that the problem is proxy-related would be good to include in the question. It's not impossible that the problem is a combination of two things, or that the proxy is *also* a problem but that you have another problem which we cannot see from the details you currently provide. – tripleee Mar 06 '18 at 12:58
  • Also, where exactly are you entering this proxy configuration, and how does it apply to the `python -m` scenario? – tripleee Mar 06 '18 at 13:00
  • I am executing `nltk.download()` command on python command prompt. Also it seems that the call to `download()` tries to access [this url](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml), which is already included in the problem. `nltk.set_proxy()` is also executed on python command prompt. It has to be executed on python command prompt only, right? – Mahesha999 Mar 06 '18 at 13:19
  • Is there any firewall blocking your machine to download any files from the internet (not nltk)? Do a `wget https://norvig.com/big.txt` on the terminal, did it successfully download? – alvas Mar 07 '18 at 02:00
  • @alvas I am able to open that link in browser. BTW I am on Windows 7, not on any *nix machine – Mahesha999 Mar 07 '18 at 05:37
  • You have to download the file through the command prompt to check if some thing is blocking you. Also, see https://gist.github.com/alvations/0ed8641d7d2e1941b9f9 – alvas Mar 07 '18 at 06:13
  • After you've installed as the github gist instruction, in powershell python, `import nltk; nltk.download('popular')` – alvas Mar 07 '18 at 06:14
  • I dont have admin privileges on this machine. So I guess installing anaconda is possible. Seems that instructions [here](https://medium.com/@satorulogic/how-to-manually-download-a-nltk-corpus-f01569861da9) are working. I tried downloading several corpora (inaugural, gutenberg etc) from [here](http://www.nltk.org/nltk_data/) and putting them in folder `D:\nltk_data\corpora` and them using them. But same is not working with `punkt` corpora. It seems that unlike other corpora, which contained text, `punkt` contains pickle files. How do I import it? – Mahesha999 Mar 13 '18 at 10:27

0 Answers0