I am currently using Jupyter Notebook and Regex in Python to create a Word and Definition dictionary from a txt format dictionary file.
Sample data from text file:
ABACINATE\nA*bac"i*nate, v.t. Etym: [LL. abacinatus, p.p. of abacinare; ab off +\nbacinus a basin.]\n\nDefn: To blind by a red-hot metal plate held before the eyes. [R.]\n\nABACINATION\nA*bac`i*na"tion, n.\n\nDefn: The act of abacinating. [R.]\n\n
The pattern I am trying to create includes getting all capital letters for the word, followed by removing the text up until the definition.
Desired output
{'word': 'ABACINATE', 'definition': To blind by a red-hot metal plate held before the eyes.'}
{'word': 'ABACINATION', 'definition': The act of abacinating.'}
The pattern I have already tried is
pattern="""
(?P<word>[A-Z*]{3,}) #retrieve capital letter word
(\n.*\n\n\Defn:) #ignore all text up until Defn:
(?P<definition>\w*) #retrieve any worded character after Defn:
(.\ ) #end at the full stop and space
"""
for item in re.finditer(pattern,all_words,re.VERBOSE):
print(item.groupdict())
I'm struggling to deal with the newline characters here. I have tried to isolate the capital letter, then start at the newline character immediately after and ignore any character up until the two newline characters before 'Defn:' and retrieve the definition ending at the full stop.
Is there a way to deal with newline characters in this way?