how to write a regular expression to match pronouns in a text file?

Question

I am trying to write a program to calculate the pronoun/proper nouns ratio.

I've tried to look for the nouns starting with capital letters to match de proper nouns and pronouns using regular expession. However, my RE to match pronouns does not work well, because it matches not only the pronouns but also words containing the charaters of the pronouns . See code below:

def pron_propn():

    while True:
        try:
            file_to_open =Path(input("\nPlease, insert your file path: "))
            dic_to_open=Path(input('\nPlease, insert your dictionary path: '))
            with open(file_to_open,'r', encoding="utf-8") as f:
                words = wordpunct_tokenize(f.read())
            with open(dic_to_open,'r', encoding="utf-8") as d:
                dic = wordpunct_tokenize(d.read())
                break         
        except FileNotFoundError:
            print("\nFile not found. Better try again")


    patt=re.compile(r"^[A-Z][a-z]+\b|^[A-Z]+\b")
    c_n= list(filter(patt.match, words))

    patt2=re.compile(r"\bhe|she|it+\b")
    pronouns= list(filter(patt2.match, words))


    propn_new=[]
    propn=[]
    other=[]
    pron=[] 

    for i in words:
        if i in c_n:
            propn.append(i)
        elif i in pronouns:
            pron.append(i)

        else:
            continue

    for j in propn:
        if j not in dic:
           propn_new.append(j)   
        else:
            other.append(j)


    print(propn_new)
    print(pron)
    print(len(pron)/len(propn))


pron_propn()

When I print the list of pronouns, I get: ['he', 'he', 'he', 'he', 'hearing', 'he', 'it', 'hear', 'it', 'he', 'it']

But I want a list like: ['he', 'he', 'he', 'he', 'he', 'it', 'it', 'he', 'it']

I also want to get the result of division: the number of pronouns found by the number of proper nouns

Can anyone help to capture pronouns only?

What is `m`? Your code can't possibly work, not even well-enough to get to division by zero. Please make sure what you posted correctly reflects the problem code. — Amadan, May 29 '19 at 11:52
Did you mean to iterate words? Like `for i in words.split():`? Note that your text has no `PRON nsubj`, it has `PROPN nsubj`, and this is a two word combination, `words.split()` won't work. Please clarify what you are doing. — Wiktor Stribiżew, May 29 '19 at 13:30
@WiktorStribiżew. I've updated my questions and code. The problem now is with the regular expression used to match pronouns only. — Natalia Resende, May 29 '19 at 17:29
`patt2=re.compile(r"\bhe|she|it+\b")` is wrong, use `patt2=re.compile(r"\b(?:s?he|it)\b")` — Wiktor Stribiżew, May 29 '19 at 17:34

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

We can have one capturing group, with word boundary, and add our desired pronouns to it, with an expression similar to:

(\b(s?he|it)\b)

If we wish, we can add more constraints.

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(\b(s?he|it)\b)"

test_str = "Anything she wish before it. Anything he wish after it. Then, we repeat. Anything she wish before it. Anything he wish after it. Then, we repeat. Anything she wish before it. Anything he wish after it. Then, we repeat. Anything she wish before it. Anything he wish after it. Then, we repeat. Anything she wish before it. Anything he wish after it. Then, we repeat. "

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Then, we can script the rest and count pronouns, count all words, and we would simply divide those to get the ratio.

DEMO

RegEx Circuit

jex.im visualizes regular expressions:

how to write a regular expression to match pronouns in a text file?

1 Answers1

Test

DEMO

RegEx Circuit