I am trying to write a program to calculate the pronoun/proper nouns ratio.
I've tried to look for the nouns starting with capital letters to match de proper nouns and pronouns using regular expession. However, my RE to match pronouns does not work well, because it matches not only the pronouns but also words containing the charaters of the pronouns . See code below:
def pron_propn():
while True:
try:
file_to_open =Path(input("\nPlease, insert your file path: "))
dic_to_open=Path(input('\nPlease, insert your dictionary path: '))
with open(file_to_open,'r', encoding="utf-8") as f:
words = wordpunct_tokenize(f.read())
with open(dic_to_open,'r', encoding="utf-8") as d:
dic = wordpunct_tokenize(d.read())
break
except FileNotFoundError:
print("\nFile not found. Better try again")
patt=re.compile(r"^[A-Z][a-z]+\b|^[A-Z]+\b")
c_n= list(filter(patt.match, words))
patt2=re.compile(r"\bhe|she|it+\b")
pronouns= list(filter(patt2.match, words))
propn_new=[]
propn=[]
other=[]
pron=[]
for i in words:
if i in c_n:
propn.append(i)
elif i in pronouns:
pron.append(i)
else:
continue
for j in propn:
if j not in dic:
propn_new.append(j)
else:
other.append(j)
print(propn_new)
print(pron)
print(len(pron)/len(propn))
pron_propn()
When I print the list of pronouns, I get: ['he', 'he', 'he', 'he', 'hearing', 'he', 'it', 'hear', 'it', 'he', 'it']
But I want a list like: ['he', 'he', 'he', 'he', 'he', 'it', 'it', 'he', 'it']
I also want to get the result of division: the number of pronouns found by the number of proper nouns
Can anyone help to capture pronouns only?