0

I want to leverage NLTK for NLP tasks on a hadoop cluster through Pyspark. We use an Anaconda distribution. The cluster is in an air-gapped environment, thus I cannot run nltk.download().

I'm thinking I need to download the data on a secondary machine with internet access. Where do I download it from? And how do I install it on the hadoop cluster? Do I just copy the files? Or does nltk needs to know where the data is? Does the data need to be copied on all nodes?

ADJ
  • 4,892
  • 10
  • 50
  • 83
  • "Does the data need to be copied on all nodes?" Yes but what are you using the data for? Which dataset do you need? – alvas Jan 07 '17 at 16:22

1 Answers1

1

Where do I download it from?

You could execute nltk.download() on your machine and the data would get downloaded into your home directory under folder nltk_data

And how do I install it on the hadoop cluster? Do I just copy the files? Or does nltk needs to know where the data is?

It should be sufficient if you copy the nltk_data to the home folder on the machines under the user that executes the processes. If it is not possible, you can use NLTK_DATA environment variable to set the location. See How to config nltk data directory from code? for more discussion about this

Does the data need to be copied on all nodes?

Yes

Community
  • 1
  • 1
Timo
  • 5,188
  • 6
  • 35
  • 38