0

I have a text and I have got a task in python with reading module:

Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:

  • If Mr. Churchill is in the novel, then include {'Churchill' : 2}
  • If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}

The file is .txt and it contains around 10-15 paragraphs.

Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)

orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
    wordsdirty = line.split()
    try:
        print (wordsdirty[wordsdirty.index('Mr.') + 1])
    except ValueError:
        continue
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45

2 Answers2

0

Try this:

text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]

You get:

['Mr. Churchill', 'Mr. James Brown']

To solve the line issue simply read the entire file:

text = file.read()

Then, to count the occurrences, simply run:

Counter(m)

Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].

rudolfovic
  • 3,163
  • 2
  • 14
  • 38
0

This can be easily done using regex and capturing group.

Take a look here for reference, in this scenario you might want to do something like

# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file)  # not sure about the regex

# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)

To access the entire file like that, you might want to map it to memory, take a look at this question

ozerodb
  • 543
  • 3
  • 13