22

I've got a short function to check whether a word is a real word by comparing it to the WordNet corpus from the Natural Language Toolkit. I'm calling this function from a thread that validates txt files. When I run my code, the first time the function is called it throws a AttributeError with the message

"'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'"

When I pause execution, the same line of code does not throw an error, so I assume that the corpus is not yet loaded on my first call causing the error.

I have tried using nltk.wordnet.ensure_loaded() to force load the corpus, but I'm still getting the same error.

Here's my function:

from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.corpus.reader.wordnet import WordNetError
import sys

cachedStopWords = stopwords.words("english")

def is_good_word(word):
    word = word.strip()
    if len(word) <= 2:
        return 0
    if word in cachedStopWords:
        return 0
    try:
        wn.ensure_loaded()
        if len(wn.lemmas(str(word), lang='en')) == 0:
            return 0
    except WordNetError as e:
        print "WordNetError on concept {}".format(word)
    except AttributeError as e:
        print "Attribute error on concept {}: {}".format(word, e.message)
    except:
        print "Unexpected error on concept {}: {}".format(word, sys.exc_info()[0])
    else:
        return 1
    return 1

print (is_good_word('dog')) #Does NOT throw error

If I have a print statement in the same file at the global scope, it does not throw the error. However, if I call it from my thread, it does. The following is a minimal example to reproduce the error. I've tested it and on my machine it gives the output

Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'
Attribute error on concept dog: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'

Minimal example:

import time
import threading
from filter_tag import is_good_word

class ProcessMetaThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        is_good_word('dog') #Throws error


def process_meta(numberOfThreads):

    threadsList = []
    for i in range(numberOfThreads):
        t = ProcessMetaThread()
        t.setDaemon(True)
        t.start()
        threadsList.append(t)

    numComplete = 0
    while numComplete < numberOfThreads:
        # Iterate over the active processes
        for processNum in range(0, numberOfThreads):
            # If a process actually exists
            if threadsList != None:
                # If the process is finished
                if not threadsList[processNum] == None:
                    if not threadsList[processNum].is_alive():
                        numComplete += 1
                        threadsList[processNum] = None
        time.sleep(5)

    print 'Processes Finished'


if __name__ == '__main__':
    process_meta(10)
Justin O Barber
  • 11,291
  • 2
  • 40
  • 45
Cecilia
  • 4,512
  • 3
  • 32
  • 75

2 Answers2

28

I have run your code and get the same error. For a working solution, see below. Here is the explanation:

LazyCorpusLoader is a proxy object that stands in for a corpus object before the corpus is loaded. (This prevents the NLTK from loading massive corpora into memory before you need them.) The first time this proxy object is accessed, however, it becomes the corpus you intend to load. That is to say, the LazyCorpusLoader proxy object transforms its __dict__ and __class__ into the __dict__ and __class__ of the corpus you are loading.

If you compare your code to your errors above, you can see that you received 9 errors when you tried to create 10 instances of your class. The first transformation of the LazyCorpusLoader proxy object into a WordNetCorpusReader object was successful. This action was triggered when you accessed wordnet for the first time:

The First Thread

from nltk.corpus import wordnet as wn
def is_good_word(word):
    ...
    wn.ensure_loaded()  # `LazyCorpusLoader` conversion into `WordNetCorpusReader` starts

The Second Thread

When you begin to run your is_good_word function in a second thread, however, your first thread has not completely transformed the LazyCorpusLoader proxy object into a WordNetCorpusReader. wn is still a LazyCorpusLoader proxy object, so it begins the __load process again. Once it gets to the point where it tries to convert its __class__ and __dict__ into a WordNetCorpusReader object, however, the first thread has already converted the LazyCorpusLoader proxy object into a WordNetCorpusReader. My guess is that you are running into an error in the line with my comment below:

class LazyCorpusLoader(object):
    ...
    def __load(self):
        ...
        corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)  # load corpus
        ...
        # self.__args == self._LazyCorpusLoader__args
        args, kwargs  = self.__args, self.__kwargs                       # most likely the line throwing the error

Once the first thread has transformed the LazyCorpusLoader proxy object into a WordNetCorpusReader object, the mangled names will no longer work. The WordNetCorpusReader object will not have LazyCorpusLoader anywhere in its mangled names. (self.__args is equivalent to self._LazyCorpusLoader__args while the object is a LazyCorpusLoader object.) Thus you get the following error:

AttributeError: 'WordNetCorpusReader' object has no attribute '_LazyCorpusLoader__args'

An Alternative

In light of this issue, you will want to access the wn object before you enter into your threading. Here is your code modified appropriately:

from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
from nltk.corpus.reader.wordnet import WordNetError
import sys
import time
import threading

cachedStopWords = stopwords.words("english")


def is_good_word(word):
    word = word.strip()
    if len(word) <= 2:
        return 0
    if word in cachedStopWords:
        return 0
    try:
        if len(wn.lemmas(str(word), lang='en')) == 0:     # no longer the first access of wn
            return 0
    except WordNetError as e:
        print("WordNetError on concept {}".format(word))
    except AttributeError as e:
        print("Attribute error on concept {}: {}".format(word, e.message))
    except:
        print("Unexpected error on concept {}: {}".format(word, sys.exc_info()[0]))
    else:
        return 1
    return 1


class ProcessMetaThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        is_good_word('dog')


def process_meta(numberOfThreads):
    print wn.__class__            # <class 'nltk.corpus.util.LazyCorpusLoader'>
    wn.ensure_loaded()            # first access to wn transforms it
    print wn.__class__            # <class 'nltk.corpus.reader.wordnet.WordNetCorpusReader'>
    threadsList = []
    for i in range(numberOfThreads):
        start = time.clock()
        t = ProcessMetaThread()
        print time.clock() - start
        t.setDaemon(True)
        t.start()
        threadsList.append(t)

    numComplete = 0
    while numComplete < numberOfThreads:
        # Iterate over the active processes
        for processNum in range(0, numberOfThreads):
            # If a process actually exists
            if threadsList != None:
                # If the process is finished
                if not threadsList[processNum] == None:
                    if not threadsList[processNum].is_alive():
                        numComplete += 1
                        threadsList[processNum] = None
        time.sleep(5)

    print('Processes Finished')


if __name__ == '__main__':
    process_meta(10)

I have tested the above code and received no errors.

Justin O Barber
  • 11,291
  • 2
  • 40
  • 45
  • 2
    Great explanation! But can you explain how `wm` can be in an inconsistent state? Its dicts are thread-global, so it contains the `WordnetCorpusReader`'s mangled attributes. But what thread-local information makes it look like a `LazyCorpusLoader` to the second thread? – alexis Dec 12 '14 at 12:31
  • @alexis great question! I'll see if I can flesh this out a bit as soon as I get the chance. (It is very early morning here.) The bottom line is that wn is still a proxy object when the second thread first accesses it despite having had its class and dict changed, but the details are an important thing to include here. – Justin O Barber Dec 12 '14 at 12:40
  • Thanks, I'll look forward to reading it. I've no idea what kind of "proxy object" would break _reproducibly_ like this. (A race condition, with thread 2 seeing the object in mid-transformation, shouldn't give such consistent errors.) – alexis Dec 12 '14 at 12:48
  • @alexis Ah, I see your point. Excellent point. I would not say this is a race condition. Rather, I suspect that the OP's machine is loading the corpus slowly, which stalls all 10 proxy objects while they load the corpus. (See my edited answer above for a fuller explanation of my theory.) In any case, my explanation of the error should still be accurate, even if my theory turns out to be incorrect. (Perhaps the OP will confirm one way or another?) Thanks for the insight. – Justin O Barber Dec 12 '14 at 14:13
  • Good answer, although for the record, I would call it a race condition - just a very slow one which is why you get such consistency. – Basic Dec 12 '14 at 19:31
  • @Basic Good point. You are right. The slowness of the process led me to use different language, but this is a race condition. – Justin O Barber Dec 12 '14 at 19:33
  • Great answer. I moved ``wn.ensure_loaded`` outside the ``is_good_word`` function definition, and I haven't been getting any AttributeErrors. Unfortunately, I'm still getting a WordNetError with a garbage message. Before I write a new question, I just wanted to confirm that you aren't experiencing any errors running the edited example on your machine. – Cecilia Dec 12 '14 at 23:39
  • Nevermind, I put it in the wrong place. The way you do it everything works like a charm. – Cecilia Dec 12 '14 at 23:49
  • @2cents Great. Glad you got it figured out! This was a great question by the way. Best of luck! – Justin O Barber Dec 13 '14 at 01:15
  • Thanks, makes sense now. But a quibble with the new version. "`wn` _in that thread_ is still a LazyCorpusLoader proxy object,": The point of the race condition is that `wn` in all threads points to _the same_ object; so this is misleading. – alexis Dec 13 '14 at 14:30
  • @alexis Good point. That is misleading. I have changed the answer accordingly. Thanks again. – Justin O Barber Dec 13 '14 at 14:43
  • @JustinBarber Hi, my class is not identifying ensure_loaded() . It says "Undefined variable from import: ensure_loaded". Please help me with this. – Alekhya Vemavarapu Oct 12 '15 at 08:40
  • 1
    Hello, @Alekhya Vemavarapu. I suspect you are using Pydev? If so, you can see [how to deal with such errors here](http://stackoverflow.com/a/2248987/1775603). If not, what version of the NLTK are you running? – Justin O Barber Oct 12 '15 at 10:49
  • Hi, 3.0.5 is the version of NLTK i'm using. Python 2.7.7 – Alekhya Vemavarapu Oct 12 '15 at 11:43
0

I had this issue recently trying to use wordnet synsets, and I found another way around. I'm running an application with fast-api and uvicorn that needs to handle thousands and thousands of requests per second. Tried many different solutions, but at the end the best one was to move the wordnet synsets to a separated dictionary. It increased the server start in 5 seconds (and it didn't consumed that much memory), but of course, the performance when reading data this way is superb.

from nltk.corpus import wordnet
from itertools import chain

def get_synsets(self, word):
    try:
        return synsets[word.lower()]
    except:
        return []

synsets = {}
lemmas_in_wordnet = set(chain(*[x.lemma_names() for x in wordnet.all_synsets()]))

for lemma in lemmas_in_wordnet:
    synsets[lemma] = wordnet.synsets(lemma)

The performance

timeit("synsets['word']", globals=globals(), number=1000000)
0.03934049699999953

timeit("wordnet.synsets('word')", globals=globals(), number=1000000)
11.193742591000001
rafaelc
  • 1
  • 1