I want to leverage NLTK for NLP tasks on a hadoop cluster through Pyspark. We use an Anaconda distribution.
The cluster is in an air-gapped environment, thus I cannot run nltk.download()
.
I'm thinking I need to download the data on a secondary machine with internet access. Where do I download it from? And how do I install it on the hadoop cluster? Do I just copy the files? Or does nltk needs to know where the data is? Does the data need to be copied on all nodes?