0

I have a spark-scala application and I want to use a function that can only be written in python as it uses NLTK package. My problem is that how would I provide the Nltk package to the project ,should I provide it in dependencies ,if yes then how ??

Because when I write the code using nltk package in python inside the same project it gives me an error that package nltk not found.

I know that we can use Pipe for using the python function in scala spark. but how would I add the nltk package in the same application.

Any help is appreciated !

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Shivansh
  • 3,454
  • 23
  • 46
  • Why can't you use Stanford NLP? What exactly is the python code you have? – OneCricketeer Oct 06 '16 at 08:05
  • Actually i am trying to use the collocation inside the nltk package http://www.nltk.org/howto/collocations.html that uses PMI. – Shivansh Oct 06 '16 at 08:10
  • and you can't just use pyspark instead? – OneCricketeer Oct 06 '16 at 08:12
  • No my whole code is written in scala for other things , this is the thing that we want for one feature and I tried to look for it in stanford nlp but didnt find one , hence we are going for nltk and using the pipe command – Shivansh Oct 06 '16 at 08:14
  • This? http://stackoverflow.com/questions/10882488/what-is-the-best-way-to-use-python-code-from-scala-or-java or this http://stackoverflow.com/questions/32975636/how-to-use-both-scala-and-python-in-a-same-spark-project seems to be some solutions. You have to make sure all nodes in your cluster have the ntlk module installed, then you just import it in your python code like normal – OneCricketeer Oct 06 '16 at 08:18
  • Actually I have seen these two ,but they do not talk about how to add a third party python package inside your scala project and I am actually looking for that ! So that I can use it inside the project ! – Shivansh Oct 06 '16 at 08:21
  • 2
    And I'm telling you that assuming you pipe something into the Python, you cannot "include" ntlk as part of your spark-scala application. The python module must reside on all Spark nodes. Then, you can "include" your python file, which will import that module – OneCricketeer Oct 06 '16 at 08:24
  • @cricket_007: Can you point me to a example repository ? It would be very much helpful ! – Shivansh Nov 28 '16 at 12:31
  • 1
    I don't have any examples because this is an issue external to Spark. You must access all nodes of your cluster, and ensure `nltk` is installed on all of them. That was all I was saying. If you read that second link above, they `import sys`. You just `import nltk` like a regular Python script. If you get "module not found", then you are missing the package, just as the error says – OneCricketeer Nov 28 '16 at 16:23

0 Answers0