-2

I have installed the NLTK library on two computers, in one of them is working fairly good (It processes about 1000 sentences in about 1 minute), and in my other computer it takes 1 minute for 10 sentences.

Saying that my second computer is faster, so it has nothing to do with my second computer.

This is the way I have installed it:

pip install nltk

then, I run python

In the python terminal: import nltk

then, ntlk.download()

It says that I have some of the all-corpora packages out of date (I don't know why) but it seems it is only this one: PanLex Lite Corpus, which I think has nothing do do with my problem.... and this other not installed: Cross-Framework and Cross-Domain Parser Evaluation Shared Task. I don't know if it could have something do do...

Those are the modules I am using:

from nltk import pos_tag
from nltk import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

And they work terribly slow...

Does anyone know why and know how to solve it?

ham-sandwich
  • 3,975
  • 10
  • 34
  • 46
aDoN
  • 1,877
  • 4
  • 39
  • 55
  • erm, the answer is use another library? contribute faster implementation to the open source? or simply just parallelize the algorithms? I find using `graphlab` and the use `SFrame.apply()` automatically parallelize application of the functions easily. Since they don't assume a sequential list structure but a dataframe of Series where each row/Series is independent of each other. – alvas Jan 10 '16 at 15:31
  • BTW, i still don't get the question. The `pos_tag` and `word_tokenize` has been optimized over the years to get it fast enough. Could you explain the problem with some example of input, output and timing? – alvas Jan 10 '16 at 15:32
  • hey! But the thing is that I have been working with the nltk library before without problems. – aDoN Jan 10 '16 at 15:32
  • Also see http://stackoverflow.com/a/34609945/610569 – alvas Jan 10 '16 at 15:33
  • Do you mean the import is slow or something else is slow in the code? – alvas Jan 10 '16 at 15:33
  • **Could you explain the problem with some example of input, output and timing?** – alvas Jan 10 '16 at 15:33
  • I think it's something to do with your environment and OS. Can you specific your OS distribution and machine specs? See http://pastebin.com/0Xqz5jK5 and https://github.com/alvations/stubboRNNess/blob/master/pywsdlemmatizer.py – alvas Jan 10 '16 at 15:38
  • I get good results: http://pastebin.com/TBHvwKVz. I am using Python 2.4, in a Kali 2.0. The problem is the `pos_tag`. It takes like 2 seconds per sentence. Thank you – aDoN Jan 10 '16 at 16:01
  • You're using the old version. Please use `pip` to update your `nltk` to version 3.1, then take a look at http://stackoverflow.com/a/34609945/610569 And why are you using python 2.4. The minimal requirement for `nltk` is version 2.7. Please update your python too. – alvas Jan 10 '16 at 16:20
  • Sorry I am using python 2.7, I made a mistake. I did this: pip install nltk --upgrade if that is what you mean. It keeps working incredibly slow. Thanks – aDoN Jan 11 '16 at 17:32
  • What is working incredibly slow >_< Is it the upgrading? Data downloading? Importing? Or running the functions. Please show some timing and what explicitly refer to the "thing" that "it" is referring to that is slow... See http://stackoverflow.com/questions/8220801/how-to-use-timeit-module – alvas Jan 11 '16 at 19:04
  • I have just lemmatized 100,000 sentences with at least 10 words per sentence and it took me 20 mins to finish using https://github.com/alvations/stubboRNNess/blob/master/pywsdlemmatizer.py But I used some parallization tricks. Without parallelization, i think it might take 40 mins. But still I'm lemmatizing 1,000,000 words at 40 mins. That's 0.0024 secs per word at 40 mins. – alvas Jan 11 '16 at 19:07
  • Hey thank you for your responses, it is the `pos_tag` that is terribly slow, I think I already mentioned it. It takes 1 hour for 100 sentences, whereas in my other computer the SAME exact code takes 10 seconds for the same task. I don't to use other function, because on my other computer (which is worse) works alright, I have been using it, and I like how it works – aDoN Jan 11 '16 at 20:33
  • I exported the machine in which is working fine and imported in my personal computer, and it works perfectly, the only difference I see is that in one I have nltk 3.0.0 and in the other one nltk 3.1. The one with the version 3.0.0 is the one that works fine, I don't know if that has something to do – aDoN Jan 19 '16 at 09:04

1 Answers1

4

The WordNetLemmatizer may be the culprit. Wordnet needs to read from several files to work. There are lots of file access OS-level stuff that may hinder performance. Consider using another lemmatizer, see if the hard drive of the slow computer is faulty or try defragmenting it (if on windows)

Josep Valls
  • 5,483
  • 2
  • 33
  • 67
  • Hello, I appreciate your answer, but as I said, I got this already working high speed on another computer. The problem is not the lemmatizer or the pos tag, the problem is something else. – aDoN Jan 17 '16 at 20:23
  • 1
    This is just a preliminary diagnostic but it may be that the reason one computer is slower than another is because of hard-drive access. Since Wordnet is mostly hard-drive intensive, you may want to check why is it slower in one computer. Try to get hard-drive benchmarks. From Google: https://www.raymond.cc/blog/measure-actual-hard-disk-perfomance-under-windows/ – Josep Valls Jan 18 '16 at 00:12
  • Right now I'm on the computer that works fine (the one at work). This computer is much worse than my home computer, and here, it `pos_tag` is able to tag around 1000 sentences in less than one minute. I did `pip list` and I have nltk 3.0.0 – aDoN Jan 18 '16 at 09:41
  • What I'm going to try is exporting this virtual machine I'm working on and works fast, and importing in my home computer, maybe I have some extra package that makes it work much quicker or something that we are not aware of, because as I said, this computer is much much slower and works like 100 times better – aDoN Jan 18 '16 at 09:43
  • I exported the machine in which is working fine and imported in my personal computer, and it works perfectly, the only difference I see is that in one I have nltk 3.0.0 and in the other one nltk 3.1. The one with the version 3.0.0 is the one that works fine, I don't know if that has something to do – aDoN Jan 19 '16 at 09:04
  • 1
    May very well be. 3.1 was a large update. https://github.com/nltk/nltk/blob/develop/ChangeLog You can use virtualenv to install 3.1 alongside 3.0 and compare its performance on your machine. – Josep Valls Jan 19 '16 at 16:51