2

I have downloaded the following dictionary from Project Gutenberg http://www.gutenberg.org/cache/epub/29765/pg29765.txt (it is 25 MB so if you're on a slow connection avoid clicking the link)

In the file the keywords I am looking for are in uppercases for instance HALLUCINATION, then in the dictionary there are some lines dedicated to the pronunciation which are obsolete for me.

What I want to extract is the definition, indicated by "Defn" and then print the lines. I have came up with this rather ugly 'solution'

def lookup(search):
    find = search.upper()                   # transforms our search parameter all upper letters
    output = []                             # empty dummy list
    infile = open('webster.txt', 'r')       # opening the webster file for reading
    for line in infile:
        for part in line.split():
            if (find == part):
                for line in infile:
                    if (line.find("Defn:") == 0):  # ugly I know, but my only guess so far
                        output.append(line[6:])
                        print output               # uncertain about how to proceed
                        break

Now this of course only prints the first line that comes right after "Defn:". I am new when it comes to manipulate .txt files in Python and therefore clueless about how to proceed. I did read in the line in a tuple and noticed that there are special new line characters.

So I want to somehow tell Python to keep on reading until it runs out of new line characters I suppose, but also that doesn't count for the last line which has to be read.

Could someone please enhance me with useful functions I might could use to solve this problem (with a minimal example would be appreciated).


Example of desired output:

lookup("hallucination")

out: To wander; to go astray; to err; to blunder; -- used of mental processes. [R.] Byron.

lookup("hallucination")

out: The perception of objects which have no reality, or of \r\n sensations which have no corresponding external cause, arising from \r\n disorder or the nervous system, as in delirium tremens; delusion.\r\n Hallucinations are always evidence of cerebral derangement and are\r\n common phenomena of insanity. W. A. Hammond.


from text:

HALLUCINATE
Hal*lu"ci*nate, v. i. Etym: [L. hallucinatus, alucinatus, p. p. of
hallucinari, alucinari, to wander in mind, talk idly, dream.]

Defn: To wander; to go astray; to err; to blunder; -- used of mental
processes. [R.] Byron.

HALLUCINATION
Hal*lu`ci*na"tion, n. Etym: [L. hallucinatio cf. F. hallucination.]

1. The act of hallucinating; a wandering of the mind; error; mistake;
a blunder.
This must have been the hallucination of the transcriber. Addison.

2. (Med.)

Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.

HALLUCINATOR
Hal*lu"ci*na`tor, n. Etym: [L.]
Spaced
  • 231
  • 1
  • 14
  • Why not use `urllib` to access the file? – Beginner Oct 20 '14 at 17:12
  • @Beginner, I don't know that function, I only code since 3 weeks in Python :-) But thanks for mentioning it to me, I will have to google it. But accessing the file is not my problem, 'reading' it is. – Spaced Oct 20 '14 at 17:13
  • 2
    @Beginner: does OP ask about getting the file? Nope.. – RickyA Oct 20 '14 at 17:13
  • @RickyA : It was a suggestion. Hence you'll see i commented rather than posting it as an answer. Anyways your comment doesnt help in any case – Beginner Oct 20 '14 at 17:15
  • If I understand correctly you want to get a number of lines after you found a certain search term ?? – RickyA Oct 20 '14 at 17:17
  • @RickyA, yes I will also update that into my question for clarification. I am looking for a criterion such that python knows that is to keep on adding line by line into the 'output' until it runs out of new line characters, i.e. until the end of the definition is reached. My current idea is to 'count' new line characters, each new line character would indicate that one additional line has to be read. – Spaced Oct 20 '14 at 17:19
  • You know each line will have a new line char until the final line and python will stop reading the file then? – Padraic Cunningham Oct 20 '14 at 17:21
  • @PadraicCunningham, yes, I suppose so, not sure if you just posted seconds after me or if you're intentionally repeating me to guide me on the right track. – Spaced Oct 20 '14 at 17:23
  • Can you please post an example of the text you try to get here with some of text before and after it. – RickyA Oct 20 '14 at 17:23
  • @Spaced; he is not repeating you, but we need the stop condition here. You say next newline, but my guess is it will be something else, because the text you want probably also will have newlines in them.... – RickyA Oct 20 '14 at 17:27
  • @RickyA yes I believe to understand, I did add two examples above. The text file uses \r\n as new line characters, this is as far as I came. – Spaced Oct 20 '14 at 17:30
  • 1
    @Spaced, looking at the file I see `HALLUCINATION` in capitals once then a few paragraphs before `HALLUCINATOR`, do you want all lines from HALLUCINATION up to and not including `HALLUCINATOR`? – Padraic Cunningham Oct 20 '14 at 17:33
  • @PadraicCunningham, yes as in the example I have given in my updated post, I'd like to start reading aftern "Defn:", (but that's optional) until the end of the definition, that means up to and not including HALLUCINATOR, stopping after W.A. Hammond. – Spaced Oct 20 '14 at 17:35

3 Answers3

0

From here I learned an easy way to deal with memory mapped files and use them as if they were strings. Then you can use something like this to get the first definition for a term.

def lookup(search):
    term = search.upper()
    f = open('webster.txt')
    s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    index = s.find('\r\n\r\n' + term + '\r\n')
    if index == -1:
        return None
    definition = s.find('Defn:', index) + len('Defn:') + 1
    endline = s.find('\r\n\r\n', definition)
    return s[definition:endline]

print lookup('hallucination')
print lookup('hallucinate')

Assumptions:

  • There is at least one definition per term
  • If there are more than one, only the first is returned
Community
  • 1
  • 1
dreyescat
  • 13,558
  • 5
  • 50
  • 38
  • I will have to read a lot into this to understand it but it looks like a great approach. Is there a way to make the lookups 'unique' ? Meaning that they find the precise word, for instance lookup("vaccination") returns the definition of antivaccination – Spaced Oct 20 '14 at 17:53
  • Assuming all terms come after a double \r\n we can just find the concrete term. See my edit. – dreyescat Oct 20 '14 at 18:00
  • this will also find partial matches – Padraic Cunningham Oct 20 '14 at 18:26
  • @PadraicCunningham Even with the update I did before? I'm explicitly searching a term surrounded by double newline and newline, that is, alone in a single line. Maybe something I'm missing here. Could you give me an example? I tried with *vaccination* and *antivaccination* and it worked. Tried with *gate* and it worked too. – dreyescat Oct 20 '14 at 18:55
  • `lookup('halluc') Out[27]: "e Project Gutenberg EBook of Webster's Unabridged Dictionary, by Various"` – Padraic Cunningham Oct 20 '14 at 18:56
  • @PadraicCunningham Yep, but this is because I'm not checking if the term was found. So I return just the first letters of the file till the first \r\n\r\n. I'll correct it. Thanks. – dreyescat Oct 20 '14 at 19:01
0

Here is a function that returns the first definition:

def lookup(word):
    word_upper = word.upper()
    found_word = False
    found_def = False
    defn = ''
    with open('dict.txt', 'r') as file:
        for line in file:
            l = line.strip()
            if not found_word and l == word_upper:
                found_word = True
            elif found_word and not found_def and l.startswith("Defn:"):
                found_def = True
                defn = l[6:]
            elif found_def and l != '':
                defn += ' ' + l
            elif found_def and l == '':
                return defn
    return False

print lookup('hallucination')

Explanation: There are four different cases we have to consider.

  • We haven't found the word yet. We have to compare the current line to the word we are looking for in uppercases. If they are equal, we found the word.
  • We have found the word, but haven't found the start of the definition. We therefore have to look for a line that starts with Defn:. If we found it, we add the line to the definition (excluding the six characters for Defn:.
  • We have already found the start of the definition. In that case, we just add the line to the definition.
  • We have already found the start of definition and the current line is empty. The definition is complete and we return the definition.

If we found nothing, we return False.

Note: There are certain entries, e.g. CRANE, that have multiple definitions. The above code is not able to handle that. It will just return the first definition. However, it is far from easy to code a perfect solution considering the format of the file.

Tim Zimmermann
  • 6,132
  • 3
  • 30
  • 36
0

You can split into paragraphs and use the index of the search word and find the first Defn paragraph after:

def find_def(f,word):
    import re
    with open(f) as f:
        lines = f.read() 
        try:
            start = lines.index("{}\r\n".format(word)) # find where our search word is
        except ValueError: 
            return "Cannot find search term" 
        paras = re.split("\s+\r\n",lines[start:],10) # split into paragraphs using maxsplit = 10 as there are no grouping of paras longer in the definitions
        for para in paras:
            if para.startswith("Defn:"): # if para startswith Defn: we have what we need
                return para # return the  para

print(find_def("in.txt","HALLUCINATION"))

Using the whole file returns:

In [5]: print find_def("gutt.txt","VACCINATOR")
Defn: One who, or that which, vaccinates.

In [6]: print find_def("gutt.txt","HALLUCINATION")
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.

A slightly shorter version:

def find_def(f,word):
    import re
    with open(f) as f:
        lines = f.read()
        try:
            start = lines.index("{}\r\n".format(word))
        except ValueError:
            return "Cannot find search term"
        defn = lines[start:].index("Defn:")
        return re.split("\s+\r\n",lines[start+defn:],1)[0]
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321