Note Taking Program with NLTK and Wordnet doesnt work, Error message says its because of wordnet

Question

I am trying to make a program in python that will take notes on a passage that I input. It will sort out the first and last sentence of the paragraph and the sentences with dates and numbers. It would then replace some words with synonyms, and get rid of useless adjectives. I am know the generic stuff with python, but I am new to nltk and WordNet. I've started a prototype program that will replace words in a sentence with all the random synonyms, however I keep getting an error that says there is something wrong with WordNet. I think I installed it right, but I might be wrong. Here is my code:

import random
import sys
from nltk.corpus import wordnet

print('Enter your passage')
Passage = sys.stdin.readline()
PassageList = Passage.split(' ')
wordCounter = 0
syns = []

def maxInt(list):
    i = 0
    for x in list:
    i += 1
return i



for x in PassageList:
    syns = wordnet.synsets(PassageList[wordCounter])
    synLength = maxInt(syns)
    PassageList[wordCounter] == syns[0]
    print(PassageList[wordCounter])
    wordCounter += 1

Here is the error I keep getting:

Traceback (most recent call last):
  File "C:\Users\shoob\Documents\Programs\Python\Programs\NoteTake.py",   line 22, in <module>
    PassageList[wordCounter] == syns[0]
  File "C:\Users\shoob\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nltk\corpus\reader\wordnet.py", line 198, in __eq__
    return self._name == other._name
   AttributeError: 'str' object has no attribute '_name'

If you can help in anyway it would help me out a lot. :-D

There is a lot wrong with that Python code, so it is hard to tell where the error might lie. What you want to do is quite ambitious, so I'd recommend spending more time learning Python first. It will be a good investment of time. But if you want to persevere troubleshooting this specific error, add some print lines to show exactly what is in `syns` each time. You might also get a better answer if you make your code fully reproducible (https://stackoverflow.com/help/mcve): so hard-code some test data in `PassageList`, rather than getting it from stdin. — Darren Cook, Apr 25 '18 at 09:09

alvas · Answer 1 · 2018-04-25T10:14:20.067

In Longer

The other answer was more on the NLP side of things but here's a walkthrough on your code in the OP and see what's happening.

Python Conventions

Firstly, some conventions of Python code. Usually CamelCase variable names are not actual variables but class objects, so avoid using variables such as Passage.

Also, using better variable names help, instead of PassageList, you can call them words.

E.g.

import random
import sys
from nltk.corpus import wordnet

print('Enter your passage')
passage = sys.stdin.readline()

# The passage.split() is aka word tokenization
# note you've skipped sentence tokenization, 
# so it doesn't fit the goal of getting first and last sentence 
# that you've described in the OP
words = passage.split(' ')

Collections is your friend

Next, there are counter objects in native Python that you can make use of and that'll help you with some optimization and more readable code. E.g.

from collections import Counter
word_counter = Counter()

Take a look at https://docs.python.org/3/library/collections.html

Synsets are not Synonyms

As explained in the other answer, WordNet is indexed by meanings (aka synsets) and they are not synonyms. To get the synonyms, you can use the Synset.lemma_names() function. But they are really limited and you would have to go through the process of WSD before knowing the lemma_names of which synset to choose from any ambiguous word.

Also, explicit is better than implicit, using humanly-understandable variable names helps a lot in understanding and improving the code, so instead of syn = [], use synonyms = [].

Otherwise, it's really unclear what syn is storing.

Functions shouldn't be abused

Disregarding the wrong indentation, it's unclear what function is trying to achieve here. You are simply adding 1 to each item in a list, which essentially is the length function, so you could simply use len(x).

def maxInt(list):
    i = 0
    for x in list:
        i += 1
    return i

x = [1,2,3,4,5]
maxInt(x) == len(x)

To access an item from a list sequentially, simply loop

Moving on, we see that you're looping through each word in the list of words of the passage in a strange way.

Simplifying your OP,

Passage = sys.stdin.readline()
PassageList = Passage.split(' ')
wordCounter = 0

for x in PassageList:
    syns = wordnet.synsets(PassageList[wordCounter])

You could have easily done:

from nltk.corpus import wordnet as wn

passage =sys.stdin.readline()
words = passage.split(' ')
for word in words:
    synsets_per_word = wn.synsets(word)

Simply use len()

To check the no. of synsets for the given word, instead of

synLength = maxInt(syns)

you could just do:

from nltk.corpus import wordnet as wn

passage =sys.stdin.readline()
words = passage.split(' ')
for word in words:
    synsets_per_word = wn.synsets(word)
    num_synsets_per_word = len(synsets_per_word)

Now to the troubling line

The line:

PassageList[wordCounter] == syns[0]

Given the proper variable naming convention, we have:

word == synsets_per_word[0]

Now that's the confusing part, the left hand side is word which is of str type. And you are trying to compare it to synsets_per_word[0] which is of nltk.corpus.wordnet.Synset type.

Donc Voila

Thus when comparing the two variables with different type, the AttributeError pops up...

The bigger question is what are you trying to achieve here? My assumption is that you're thinking the synset is a str object but as explained about it's a Synset object and not a string and even if you get the lemma_names from the Synset it's a list of strings and not a str that can be compared for equivalence with a str.

So how you fix the problem

First read up on NLP, Python and what the WordNet API can do in NLTK.

Then redefine the task since you're not going to get a lot of help from WordNet with ambiguous words.

That is a very generous (two) answers Alvas. – Darren Cook Apr 27 '18 at 09:24 — Darren Cook, Apr 27 '18 at 09:24
Hope it helped the questioner =) – alvas Apr 27 '18 at 14:48 — alvas, Apr 27 '18 at 14:48

alvas · Answer 2 · 2018-04-25T09:47:56.343

In Short

Your project is a little overly ambitious.

Also, try to ask more specific questions on Stackoverflow. Focus on a finding out what is wrong and explain what help you would require. That'll help people to help you better.

In Long

Let's try and break down your requirements:

I am trying to make a program in python that will take notes on a passage that I input.

Not sure what that really means...

It will sort out the first and last sentence of the paragraph ...

The original code in the original post (OP) doesn't have any checks on the dates/numbers.

First, you need to define what is a sentence?

What counts as sentence boundary?
How are you going to detect sentences from a paragraph.

Perhaps, nltk.sent_tokenize would help:

from nltk import sent_tokenize

text = """Gwaha-ju (과하주; 過夏酒; literally "summer-passing wine") is a traditional Korean fortified rice wine. The refined rice wine cheongju (also called yakju) is fortified by adding the distilled spirit soju to produce gwaha-ju. Gwaha-baekju was first mentioned in Sanga Yorok, a mid-15th century cookbook, but the rice wine was made without fortification. The earliest recorded recipe for fortified gangha-ju appears in Eumsik dimibang, a 1670 cookbook.  Other Joseon books that mention the fortified rice wine include Jubangmun, Chisaeng yoram, Yeokjubangmun, Eumsikbo, Sallim gyeongje, Jeungbo sallim gyeongje, Gyuhap chongseo, and Imwon gyeongjeji."""

sent_tokenize(text)

... and the sentences with dates and numbers.

Hmmm.. that's how about checking for digits in the string of each sentence:

from nltk import sent_tokenize

text = """Gwaha-ju (과하주; 過夏酒; literally "summer-passing wine") is a traditional Korean fortified rice wine. The refined rice wine cheongju (also called yakju) is fortified by adding the distilled spirit soju to produce gwaha-ju. Gwaha-baekju was first mentioned in Sanga Yorok, a mid-15th century cookbook, but the rice wine was made without fortification. The earliest recorded recipe for fortified gangha-ju appears in Eumsik dimibang, a 1670 cookbook.  Other Joseon books that mention the fortified rice wine include Jubangmun, Chisaeng yoram, Yeokjubangmun, Eumsikbo, Sallim gyeongje, Jeungbo sallim gyeongje, Gyuhap chongseo, and Imwon gyeongjeji."""

for sent in sent_tokenize(text):
    if any(ch for ch in sent if ch.isdigit()):
        print(sent)

It would then replace some words ...

Then you have to define what is a word?

How do you define word boundary?
It won't be the same for different languages

Maybe with nltk.word_tokenize, e.g.

from nltk import sent_tokenize, word_tokenize

text = """Gwaha-ju (과하주; 過夏酒; literally "summer-passing wine") is a traditional Korean fortified rice wine. The refined rice wine cheongju (also called yakju) is fortified by adding the distilled spirit soju to produce gwaha-ju. Gwaha-baekju was first mentioned in Sanga Yorok, a mid-15th century cookbook, but the rice wine was made without fortification. The earliest recorded recipe for fortified gangha-ju appears in Eumsik dimibang, a 1670 cookbook.  Other Joseon books that mention the fortified rice wine include Jubangmun, Chisaeng yoram, Yeokjubangmun, Eumsikbo, Sallim gyeongje, Jeungbo sallim gyeongje, Gyuhap chongseo, and Imwon gyeongjeji."""

for sent in sent_tokenize(text):
    if any(ch for ch in sent if ch.isdigit()):
        print(word_tokenize(sent))

It would then replace some words with synonyms,

Not sure which word you would like to replace with synonyms and which synonyms you're going to choose from. But do note that WordNet is not a exactly a good thesaurus.

Each word comes with different meanings and only meanings are linked in WordNet not words, see https://stackoverflow.com/a/19383914/610569

E.g. given the word "wine":

from nltk.corpus import wordnet as wn

for synset in wn.synsets('wine'): # each meaning for the word, aka. synset
    print(synset)
    print('Words with same meaning:', synset.lemma_names(), '\n')

How do you know which synset/meaning to use?

That's is an open question. It's also known as the Word Sense Disambiguation (WSD) task.

If you just flatten and use the lemma names of all synset, the "synonyms" or replacement you want to make won't make sense. E.g.

from itertools import chain

from nltk.corpus import wordnet as wn
from nltk import sent_tokenize, word_tokenize

text = """Gwaha-ju (과하주; 過夏酒; literally "summer-passing wine") is a traditional Korean fortified rice wine. The refined rice wine cheongju (also called yakju) is fortified by adding the distilled spirit soju to produce gwaha-ju. Gwaha-baekju was first mentioned in Sanga Yorok, a mid-15th century cookbook, but the rice wine was made without fortification. The earliest recorded recipe for fortified gangha-ju appears in Eumsik dimibang, a 1670 cookbook.  Other Joseon books that mention the fortified rice wine include Jubangmun, Chisaeng yoram, Yeokjubangmun, Eumsikbo, Sallim gyeongje, Jeungbo sallim gyeongje, Gyuhap chongseo, and Imwon gyeongjeji."""

for sent in sent_tokenize(text):
    if any(ch for ch in sent if ch.isdigit()):
        for word in word_tokenize(sent):
            lemma_names = set(chain(*[synset.lemma_names() for synset in wn.synsets(word)]))
            # If you just flatten and use the lemma names of all synset, 
            # the "synonyms" or replacement you want to make won't make sense.
            print(word, '\t', lemma_names)

... and get rid of useless adjectives.

Hmmm, that'll require yet another piece of NLP process call POS tagging and it's not perfect.

Perhaps you can try nltk.pos_tag but don't expect too much of it (in terms of accuracy), e.g.

from itertools import chain

from nltk.corpus import wordnet as wn
from nltk import sent_tokenize, word_tokenize, pos_tag

text = """Gwaha-ju (과하주; 過夏酒; literally "summer-passing wine") is a traditional Korean fortified rice wine. The refined rice wine cheongju (also called yakju) is fortified by adding the distilled spirit soju to produce gwaha-ju. Gwaha-baekju was first mentioned in Sanga Yorok, a mid-15th century cookbook, but the rice wine was made without fortification. The earliest recorded recipe for fortified gangha-ju appears in Eumsik dimibang, a 1670 cookbook.  Other Joseon books that mention the fortified rice wine include Jubangmun, Chisaeng yoram, Yeokjubangmun, Eumsikbo, Sallim gyeongje, Jeungbo sallim gyeongje, Gyuhap chongseo, and Imwon gyeongjeji."""

for sent in sent_tokenize(text):
    if any(ch for ch in sent if ch.isdigit()):

        for word, tag in pos_tag(word_tokenize(sent)):
            if not tag.startswith('JJ'): # JJ* refers to adjective.
                print(word)
        print('-----')

I am know the generic stuff with python, but I am new to nltk and WordNet. I've started a prototype program that will replace words in a sentence with all the random synonyms,

Keep it up! Don't be discouraged and I think starting with the goal of building an application may not be the right place to start with NLP, try instead:

however I keep getting an error that says there is something wrong with WordNet. I think I installed it right, but I might be wrong.

Yes, there's nothing wrong with the installation.

Perhaps going through the WordNet API in NLTK would help you to understand how and what WordNet can do: http://www.nltk.org/howto/wordnet.html

Also, improving basic Python and understanding why the AttributeError is occurring would help a lot =)