I have a program that displays a frequency list of words in a text (tokenized text), but I want first: to detect the proper nouns of the text and append them in another list (Cap_nouns) second: Append the nouns that are not in a dictionary in another list (errors),
Later on, I want to create a frequency list for these errors found and another frequency list for the proper nouns found.
My idea to detect the proper nouns was to find the items that start with a capital letter and append them in this list, but it seems that my regular expression for this task does not work.
Can anyone help me with that? My code is below.
from collections import defaultdict
import re
import nltk
from nltk.tokenize import word_tokenize
with open('fr-text.txt') as f:
freq = word_tokenize(f.read())
with open ('Fr-dictionary_Upper_Low.txt') as fr:
dic = word_tokenize(fr.read())
#regular expression to detect words with apostrophes and separated by hyphens
pat=re.compile(r".,:;?!-'%|\b(\w'|w’)+\b|\w+(?:-\w+)+|\d+")
reg= list(filter(pat.match, freq))
#regular expression for words that start with a capital letter
patt=re.compile(r"\b^A-Z\b")
c_n= list(filter(patt.match, freq))
d=defaultdict(int)
#Empty list to append the items not found in the dictionary
errors=[ ]
Cnouns=[ ] #Empty list to append the items starting with a capital letter
for w in freq:
d[w]+=1
if w in reg:
continue
elif w in c_n:
Cnouns.append(w)
elif w not in dic:
errors.append(w)
for w in sorted(d, key=d.get):
print(w, d[w])
print(errors)
print(Cnouns)
If there is anything else wrong with my code, let me know, please.