In Short
Your project is a little overly ambitious.
Also, try to ask more specific questions on Stackoverflow. Focus on a finding out what is wrong and explain what help you would require. That'll help people to help you better.
In Long
Let's try and break down your requirements:
I am trying to make a program in python that will take notes on a passage that I input.
Not sure what that really means...
It will sort out the first and last sentence of the paragraph ...
The original code in the original post (OP) doesn't have any checks on the dates/numbers.
First, you need to define what is a sentence?
- What counts as sentence boundary?
- How are you going to detect sentences from a paragraph.
Perhaps, nltk.sent_tokenize
would help:
from nltk import sent_tokenize
text = """Gwaha-ju (과하주; 過夏酒; literally "summer-passing wine") is a traditional Korean fortified rice wine. The refined rice wine cheongju (also called yakju) is fortified by adding the distilled spirit soju to produce gwaha-ju. Gwaha-baekju was first mentioned in Sanga Yorok, a mid-15th century cookbook, but the rice wine was made without fortification. The earliest recorded recipe for fortified gangha-ju appears in Eumsik dimibang, a 1670 cookbook. Other Joseon books that mention the fortified rice wine include Jubangmun, Chisaeng yoram, Yeokjubangmun, Eumsikbo, Sallim gyeongje, Jeungbo sallim gyeongje, Gyuhap chongseo, and Imwon gyeongjeji."""
sent_tokenize(text)
... and the sentences with dates and numbers.
Hmmm.. that's how about checking for digits in the string of each sentence:
from nltk import sent_tokenize
text = """Gwaha-ju (과하주; 過夏酒; literally "summer-passing wine") is a traditional Korean fortified rice wine. The refined rice wine cheongju (also called yakju) is fortified by adding the distilled spirit soju to produce gwaha-ju. Gwaha-baekju was first mentioned in Sanga Yorok, a mid-15th century cookbook, but the rice wine was made without fortification. The earliest recorded recipe for fortified gangha-ju appears in Eumsik dimibang, a 1670 cookbook. Other Joseon books that mention the fortified rice wine include Jubangmun, Chisaeng yoram, Yeokjubangmun, Eumsikbo, Sallim gyeongje, Jeungbo sallim gyeongje, Gyuhap chongseo, and Imwon gyeongjeji."""
for sent in sent_tokenize(text):
if any(ch for ch in sent if ch.isdigit()):
print(sent)
It would then replace some words ...
Then you have to define what is a word?
- How do you define word boundary?
- It won't be the same for different languages
Maybe with nltk.word_tokenize
, e.g.
from nltk import sent_tokenize, word_tokenize
text = """Gwaha-ju (과하주; 過夏酒; literally "summer-passing wine") is a traditional Korean fortified rice wine. The refined rice wine cheongju (also called yakju) is fortified by adding the distilled spirit soju to produce gwaha-ju. Gwaha-baekju was first mentioned in Sanga Yorok, a mid-15th century cookbook, but the rice wine was made without fortification. The earliest recorded recipe for fortified gangha-ju appears in Eumsik dimibang, a 1670 cookbook. Other Joseon books that mention the fortified rice wine include Jubangmun, Chisaeng yoram, Yeokjubangmun, Eumsikbo, Sallim gyeongje, Jeungbo sallim gyeongje, Gyuhap chongseo, and Imwon gyeongjeji."""
for sent in sent_tokenize(text):
if any(ch for ch in sent if ch.isdigit()):
print(word_tokenize(sent))
It would then replace some words with synonyms,
Not sure which word you would like to replace with synonyms and which synonyms you're going to choose from. But do note that WordNet is not a exactly a good thesaurus.
Each word comes with different meanings and only meanings are linked in WordNet not words, see https://stackoverflow.com/a/19383914/610569
E.g. given the word "wine":
from nltk.corpus import wordnet as wn
for synset in wn.synsets('wine'): # each meaning for the word, aka. synset
print(synset)
print('Words with same meaning:', synset.lemma_names(), '\n')
How do you know which synset/meaning to use?
That's is an open question. It's also known as the Word Sense Disambiguation (WSD) task.
If you just flatten and use the lemma names of all synset, the "synonyms" or replacement you want to make won't make sense. E.g.
from itertools import chain
from nltk.corpus import wordnet as wn
from nltk import sent_tokenize, word_tokenize
text = """Gwaha-ju (과하주; 過夏酒; literally "summer-passing wine") is a traditional Korean fortified rice wine. The refined rice wine cheongju (also called yakju) is fortified by adding the distilled spirit soju to produce gwaha-ju. Gwaha-baekju was first mentioned in Sanga Yorok, a mid-15th century cookbook, but the rice wine was made without fortification. The earliest recorded recipe for fortified gangha-ju appears in Eumsik dimibang, a 1670 cookbook. Other Joseon books that mention the fortified rice wine include Jubangmun, Chisaeng yoram, Yeokjubangmun, Eumsikbo, Sallim gyeongje, Jeungbo sallim gyeongje, Gyuhap chongseo, and Imwon gyeongjeji."""
for sent in sent_tokenize(text):
if any(ch for ch in sent if ch.isdigit()):
for word in word_tokenize(sent):
lemma_names = set(chain(*[synset.lemma_names() for synset in wn.synsets(word)]))
# If you just flatten and use the lemma names of all synset,
# the "synonyms" or replacement you want to make won't make sense.
print(word, '\t', lemma_names)
... and get rid of useless adjectives.
Hmmm, that'll require yet another piece of NLP process call POS tagging and it's not perfect.
Perhaps you can try nltk.pos_tag
but don't expect too much of it (in terms of accuracy), e.g.
from itertools import chain
from nltk.corpus import wordnet as wn
from nltk import sent_tokenize, word_tokenize, pos_tag
text = """Gwaha-ju (과하주; 過夏酒; literally "summer-passing wine") is a traditional Korean fortified rice wine. The refined rice wine cheongju (also called yakju) is fortified by adding the distilled spirit soju to produce gwaha-ju. Gwaha-baekju was first mentioned in Sanga Yorok, a mid-15th century cookbook, but the rice wine was made without fortification. The earliest recorded recipe for fortified gangha-ju appears in Eumsik dimibang, a 1670 cookbook. Other Joseon books that mention the fortified rice wine include Jubangmun, Chisaeng yoram, Yeokjubangmun, Eumsikbo, Sallim gyeongje, Jeungbo sallim gyeongje, Gyuhap chongseo, and Imwon gyeongjeji."""
for sent in sent_tokenize(text):
if any(ch for ch in sent if ch.isdigit()):
for word, tag in pos_tag(word_tokenize(sent)):
if not tag.startswith('JJ'): # JJ* refers to adjective.
print(word)
print('-----')
I am know the generic stuff with python, but I am new to nltk and WordNet. I've started a prototype program that will replace words in a sentence with all the random synonyms,
Keep it up! Don't be discouraged and I think starting with the goal of building an application may not be the right place to start with NLP, try instead:
however I keep getting an error that says there is something wrong with WordNet. I think I installed it right, but I might be wrong.
Yes, there's nothing wrong with the installation.
Perhaps going through the WordNet API in NLTK would help you to understand how and what WordNet can do: http://www.nltk.org/howto/wordnet.html
Also, improving basic Python and understanding why the AttributeError
is occurring would help a lot =)