So I built a data and NLP app with a flask front end which functions perfectly locally but a flood of issues began as I tried to set this up on AWS (Ubuntu linux) behind an Apache server as we are advised to do (flask not being designed for deployment). The importing of python modules suddenly becomes quite a challenge with this setup, there are a number of questions about it on Stackoverflow. Had worked through about 5 of these issues, was using lots of Python logging statements each time throughout the code to see at what point exactly the various scripts were crashing or hanging (without error messages) and then it got further than ever before and this issue about the location of the NLTK corpus came up. Not the NLTK module, that imported fine, just the corpus folder.
So to do this usually just takes the following code:
import nltk
nltk.download()
which opens up a kind of a user interface where you can select which corpus or NLP item to download and whether to change the directory to store it somewhere else. By default it makes the directory nltk_data/
in your home directory and puts it in there.
So I thought at first the issue was that the folder needed permissions for the Apache user www-data
but that didn't work. Then noticed in the Apache error log that it had looked in 4 folders and not found anything - one of them being /var/www/nltk_data
and none of them being the home directory where it actually was. Can't remember the other 3..
I looked through a couple of similar questions on Stackoverflow (1,2,3,4) but decided to go with something simpler. So did the following:
sudo mkdir /var/www/nltk_data
sudo cp -r nltk_data/ /var/www/
Then I flushed the Apache log again, restarted the server and started checking the log. It was running at the usual rate, taking a few minutes to get through the scripts, was rechecking the log, and new logging messages kept appearing, then some message about memory kept repeating on the ssh screen and the logs were no longer visible. I couldn't type anything. It logged me out, wouldn't let me log back in. Went into the AWS console, rebooted it twice. Stopped it, started it again, still couldn't log in. So in anger, terminated it. Regretted doing that but didn't feel there was any point keeping it there if you couldn't log into it.
Questions:
- Was that okay to copy the nltk_data directory to /var/www where Apache was looking, should I do that again?
- If an EC2 instance runs out of memory, does it typically kill it so badly that even after stopping and starting it, you can no longer log in?
- If this happens again, from my local terminal, is there a way to reboot the thing in 'safe' mode, so I can get in there and try to undo whatever I did that killed it? (rather than just effectively deleting the whole thing)
Will likely have another go now, recreate the instance, perhaps the 4MB ram was not enough for my app and Apache together, would like to know the answers to these things or any other relevant tips for when I next have to do this nltk step..