0

I am currently using Jupyter Notebook and Regex in Python to create a Word and Definition dictionary from a txt format dictionary file.

Sample data from text file: ABACINATE\nA*bac"i*nate, v.t. Etym: [LL. abacinatus, p.p. of abacinare; ab off +\nbacinus a basin.]\n\nDefn: To blind by a red-hot metal plate held before the eyes. [R.]\n\nABACINATION\nA*bac`i*na"tion, n.\n\nDefn: The act of abacinating. [R.]\n\n

The pattern I am trying to create includes getting all capital letters for the word, followed by removing the text up until the definition.

Desired output

{'word': 'ABACINATE', 'definition': To blind by a red-hot metal plate held before the eyes.'}
{'word': 'ABACINATION', 'definition': The act of abacinating.'}

The pattern I have already tried is

pattern="""
(?P<word>[A-Z*]{3,}) #retrieve capital letter word
(\n.*\n\n\Defn:) #ignore all text up until Defn:
(?P<definition>\w*) #retrieve any worded character after Defn:
(.\ ) #end at the full stop and space
"""
for item in re.finditer(pattern,all_words,re.VERBOSE):
    print(item.groupdict())

I'm struggling to deal with the newline characters here. I have tried to isolate the capital letter, then start at the newline character immediately after and ignore any character up until the two newline characters before 'Defn:' and retrieve the definition ending at the full stop.

Is there a way to deal with newline characters in this way?

beeslaw
  • 21
  • 5
  • I would argue that this is not a good application for regular expressions. You have specific search strings in a specific order. Just grab through a '\n' for the word, then `str.find("Defn:",i)` to search for Defn from that point. – Tim Roberts Apr 15 '21 at 00:28

1 Answers1

0

You mostly had it, just missing a non-greedy match and an expanded set for the characters in the definitions.

import re
all_words = """ABACINATE\nA*bac"i*nate, v.t. Etym: [LL. abacinatus, p.p. of abacinare; ab off +\nbacinus a basin.]\n\nDefn: To blind by a red-hot metal plate held before the eyes. [R.]\n\nABACINATION\nA*bac`i*na"tion, n.\n\nDefn: The act of abacinating. [R.]\n\n"""

pattern="""
(?P<word>[A-Z*]{3,})([\s\S]*?Defn:)(?P<definition>[a-zA-Z -]*)
"""
for item in re.finditer(pattern,all_words,re.VERBOSE):
    print(item.groupdict())

{'word': 'ABACINATE', 'definition': ' To blind by a red-hot metal plate held before the eyes'} {'word': 'ABACINATION', 'definition': ' The act of abacinating'}

user1717828
  • 7,122
  • 8
  • 34
  • 59
  • Thank you for your answer. This works, but does not include definitions that include other full stops, commas and quotations characters. Is there a workaround for the [a-zA-Z -]* part? – beeslaw Apr 15 '21 at 02:01
  • @beeslaw, I mean, there's lots of ways, but without you listing more desired input/output in the question I can only guess what you want. For example, `(?P[A-Z*]{3,})([\s\S]*?Defn:\ )(?P.+?\.(?= ))` does what you say but I don't know if it does what you want. – user1717828 Apr 15 '21 at 12:02
  • That is perfect. Thank you so much! – beeslaw Apr 15 '21 at 23:35