0

So I built a data and NLP app with a flask front end which functions perfectly locally but a flood of issues began as I tried to set this up on AWS (Ubuntu linux) behind an Apache server as we are advised to do (flask not being designed for deployment). The importing of python modules suddenly becomes quite a challenge with this setup, there are a number of questions about it on Stackoverflow. Had worked through about 5 of these issues, was using lots of Python logging statements each time throughout the code to see at what point exactly the various scripts were crashing or hanging (without error messages) and then it got further than ever before and this issue about the location of the NLTK corpus came up. Not the NLTK module, that imported fine, just the corpus folder.

So to do this usually just takes the following code:

import nltk
nltk.download()

which opens up a kind of a user interface where you can select which corpus or NLP item to download and whether to change the directory to store it somewhere else. By default it makes the directory nltk_data/ in your home directory and puts it in there.

So I thought at first the issue was that the folder needed permissions for the Apache user www-data but that didn't work. Then noticed in the Apache error log that it had looked in 4 folders and not found anything - one of them being /var/www/nltk_data and none of them being the home directory where it actually was. Can't remember the other 3..

I looked through a couple of similar questions on Stackoverflow (1,2,3,4) but decided to go with something simpler. So did the following:

sudo mkdir /var/www/nltk_data
sudo cp -r nltk_data/   /var/www/

Then I flushed the Apache log again, restarted the server and started checking the log. It was running at the usual rate, taking a few minutes to get through the scripts, was rechecking the log, and new logging messages kept appearing, then some message about memory kept repeating on the ssh screen and the logs were no longer visible. I couldn't type anything. It logged me out, wouldn't let me log back in. Went into the AWS console, rebooted it twice. Stopped it, started it again, still couldn't log in. So in anger, terminated it. Regretted doing that but didn't feel there was any point keeping it there if you couldn't log into it.

Questions:

  1. Was that okay to copy the nltk_data directory to /var/www where Apache was looking, should I do that again?
  2. If an EC2 instance runs out of memory, does it typically kill it so badly that even after stopping and starting it, you can no longer log in?
  3. If this happens again, from my local terminal, is there a way to reboot the thing in 'safe' mode, so I can get in there and try to undo whatever I did that killed it? (rather than just effectively deleting the whole thing)

Will likely have another go now, recreate the instance, perhaps the 4MB ram was not enough for my app and Apache together, would like to know the answers to these things or any other relevant tips for when I next have to do this nltk step..

cardamom
  • 6,873
  • 11
  • 48
  • 102
  • I can't speak to AWS, but there's no reason for `nltk_data` to be under `/var/www`. Just ensure your `nltk` can find it, by setting the environment variable `NLTK_DATA` (or adjust the `nltk.data.path` in your application, see https://stackoverflow.com/questions/3522372/how-to-config-nltk-data-directory-from-code) – alexis Jul 07 '17 at 21:03
  • If 'import nltk` works properly (and you have all the dependencies as well), of the four questions you linked to only `1` is relevant-- and it has no answer. Question 3 is irrelevant under any circumstances. – alexis Jul 07 '17 at 21:12
  • Thanks for the answers and looking at the posted questions.. In the mean time found a [link](https://serverfault.com/questions/170858/how-to-boot-ec2-instance-into-safe-mode) which answers one of my questions - no unfortunately EC2 instance cannot be booted in safe mode, recovery process sounds unpleasant. Have got one with more ram this time to lessen the chances of it happening again, and will see when am up to that point how and if I solve the nltk directory issue. – cardamom Jul 08 '17 at 13:47

1 Answers1

0

So I solved it thankfully after setting up a new instance and here is how:

sudo ln -sT /home/ubuntu/nltk_data  /usr/share/nltk_data

Am happy everything is now working. Have been watching the output of the top system monitor on the new instance which is slightly larger than the one which self-destructed, and notice that it never uses more than about 40% of the memory but the CPU is maxing out for long periods while the main scripts run. Maybe that is what killed the smaller instance..

When I initially went to install the stopwords it looked here: LookupError:

**********************************************************************
  Resource 'corpora/stopwords' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/home/ubuntu/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

the apache log however showed that it looked mainly in the same places except..

] [pid 19:tid 13] [client 77..]   Resource 'corpora/stopwords' not found.  Please use the NLTK
[Sat Jul 08 16:35:19.694759 2017] [wsgi:error] [pid 19437:tid 1..] [client 77..]   Downloader to obtain the resource:  >>> nltk.download()
[Sat Jul 08 16:35:19.694762 2017] [wsgi:error] [pid 19437:tid 1..] [client 77..]   Searched in:
[Sat Jul 08 16:35:19.694764 2017] [wsgi:error] [pid 19437:tid 1..] [client 77..]     - '/var/www/nltk_data'
[Sat Jul 08 16:35:19.694766 2017] [wsgi:error] [pid 19437:tid 1..] [client 77..]     - '/usr/share/nltk_data'
[Sat Jul 08 16:35:19.694768 2017] [wsgi:error] [pid 19437:tid 1..] [client 77..]     - '/usr/local/share/nltk_data'
[Sat Jul 08 16:35:19.694770 2017] [wsgi:error] [pid 19437:tid 1..] [client 77..]     - '/usr/lib/nltk_data'
[Sat Jul 08 16:35:19.694772 2017] [wsgi:error] [pid 19437:tid 1..] [client 77..]     - '/usr/local/lib/nltk_data'

So I thought it would be safer to link rather than copy and also, to use that /usr/share/ directory where Apache was looking anyway rather than mess with its own directory.

cardamom
  • 6,873
  • 11
  • 48
  • 102
  • Apache serves from `/usr/share`? You lost me there. But glad to hear you got it fixed. (The message you quote is from the `nltk.data` library, not from Apache). – alexis Jul 08 '17 at 19:29
  • Did you https://stackoverflow.com/questions/3522372/how-to-config-nltk-data-directory-from-code ? – alvas Jul 10 '17 at 22:01
  • Thanks, I had a look at that. As you can see in the Apache log, Apache was looking in 5 places excluding the 1 place where nltk_data actually was. I don't think Apache wanted to look in my home directory was easier to just place something where it was already looking. – cardamom Jul 11 '17 at 11:00