1

I'm trying to use NLTK in my Flask app on Google App Engine standard. But I'm unable to find a neat way to download / load NLTK stopwords on GAE standard.

I saw this solution for Django (How to download all nltk data in google cloud app engine?) which suggests downloading data, hosting it with all the other files on GAE, and linking nltk.data.path to it. However, that seems quite hacky and I'd also like to keep my total GAE directory size low.

I have tried to replicate this situation in GAE Flexible. There I'd just add "RUN python -m nltk.downloader all -d /usr/local/nltk_data" to my Dockerfile.

Are there any good solutions for GAE Standard?

rkj
  • 329
  • 1
  • 3
  • 11
  • 1
    Why not just copy the list from the NLTK source code and paste them into your own code? (with attribution of course) – new name Apr 02 '20 at 11:56

1 Answers1

1

I understand you want to use NLTK stopwords in GAE standard but I think you're confusing things a bit because one way or another you would need to have the file either in a folder or full in memory.

As you said, in GAE Flexible you could put RUN python -m nltk.downloader all -d /usr/local/nltk_data into the Dockerfile. In fact this command will download the NLTK stopwords file and place it into your container folder structure. In that sense it is totally equivalent to save the file yourself (as suggested in the thread you linked) or to make Docker save it for you, both end up with the file in a folder.

The alternative suggested by gaefan also implies to have the NLTK stopwords data stored although this time would be inlined in the application code rather than being in a separate file.

All in all, none of the approaches mentioned that far seems hacky to me and I would recommend any of them.

With that being said, if you really really don't want to have the file in your codebase you might as well store it in Google Cloud Storage and retrieve it. This way you may either retrieve it every time you want to do something with it or retrieve it just once and then store it in memory/tmp folder. However this option comes at the cost of application latency, ram usage and having to continuously check if the instance had downloaded it before.

Happy-Monad
  • 1,962
  • 1
  • 6
  • 13
  • I was just hesitant about using GAE Flexible, since I'd need to figure out port business in Docker for my remote DB cluster. And I barely understand how Docker works. I can deploy simple Flask API but that's it. – rkj Apr 03 '20 at 10:26
  • 1
    Yes, docker requires some configuration. I would recommend to stay with GAE standard as long as it's fits your needs, it's cheaper also. – Happy-Monad Apr 03 '20 at 10:29