9

I have a dataset of administrative filings that include short biographies. I am trying to extract people's ages by using python and some pattern matching. Some example of sentences are:

  • "Mr Bond, 67, is an engineer in the UK"
  • "Amanda B. Bynes, 34, is an actress"
  • "Peter Parker (45) will be our next administrator"
  • "Mr. Dylan is 46 years old."
  • "Steve Jones, Age: 32,"

These are some of the patterns I have identified in the dataset. I want to add that there are other patterns, but I have not run into them yet, and not sure how I could get to that. I wrote the following code that works pretty well, but is pretty inefficient so will take too much time to run on the whole dataset.

#Create a search list of expressions that might come right before an age instance
age_search_list = [" " + last_name.lower().strip() + ", age ",
" " + clean_sec_last_name.lower().strip() + " age ",
last_name.lower().strip() + " age ",
full_name.lower().strip() + ", age ",
full_name.lower().strip() + ", ",
" " + last_name.lower() + ", ",
" " + last_name.lower().strip()  + " \(",
" " + last_name.lower().strip()  + " is "]

#for each element in our search list
for element in age_search_list:
    print("Searching: ",element)

    # retrieve all the instances where we might have an age
    for age_biography_instance in re.finditer(element,souptext.lower()):

        #extract the next four characters
        age_biography_start = int(age_biography_instance.start())
        age_instance_start = age_biography_start + len(element)
        age_instance_end = age_instance_start + 4
        age_string = souptext[age_instance_start:age_instance_end]

        #extract what should be the age
        potential_age = age_string[:-2]

        #extract the next two characters as a security check (i.e. age should be followed by comma, or dot, etc.)
        age_security_check = age_string[-2:]
        age_security_check_list = [", ",". ",") "," y"]

        if age_security_check in age_security_check_list:
            print("Potential age instance found for ",full_name,": ",potential_age)

            #check that what we extracted is an age, convert it to birth year
            try:
                potential_age = int(potential_age)
                print("Potential age detected: ",potential_age)
                if 18 < int(potential_age) < 100:
                    sec_birth_year = int(filing_year) - int(potential_age)
                    print("Filing year was: ",filing_year)
                    print("Estimated birth year for ",clean_sec_full_name,": ",sec_birth_year)
                    #Now, we save it in the main dataframe
                    new_sec_parser = pd.DataFrame([[clean_sec_full_name,"0","0",sec_birth_year,""]],columns = ['Name','Male','Female','Birth','Suffix'])
                    df_sec_parser = pd.concat([df_sec_parser,new_sec_parser])

            except ValueError:
                print("Problem with extracted age ",potential_age)

I have a few questions:

  • Is there a more efficient way to extract this information?
  • Should I use a regex instead?
  • My text documents are very long and I have lots of them. Can I do one search for all the items at once?
  • What would be a strategy to detect other patterns in the dataset?

Some sentences extracted from the dataset:

  • "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation"
  • "George F. Rubin(14)(15) Age 68 Trustee since: 1997."
  • "INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006"
  • "Mr. Lovallo, 47, was appointed Treasurer in 2011."
  • "Mr. Charles Baker, 79, is a business advisor to biotechnology companies."
  • "Mr. Botein, age 43, has been a member of our Board since our formation."
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
user1029296
  • 609
  • 8
  • 17
  • 2
    Does these short biographies of ppl, contains any number other than age? – Rahul Agarwal Aug 07 '19 at 13:07
  • Yes, they do. They contain financial information that can be number of shares, amounts of money, etc. – user1029296 Aug 07 '19 at 13:11
  • So, do these other numbers have a fixed format like money would always have a dollar or pound symbol etc. ? – Rahul Agarwal Aug 07 '19 at 13:13
  • Yes, these are SEC filings so that have a format. The only two digit numbers that are not age should be percentages. – user1029296 Aug 07 '19 at 13:19
  • So, your strategy should be take a paragraph remove all the other numbers that are coming in specific formats. Then you are simply left with Age, if you can provide a short biography example, I can give the code also – Rahul Agarwal Aug 07 '19 at 13:21
  • Here is an example my system did not pick: George F. Rubin(14)(15) Age 68 Trustee since: 1997 858,600 (16) 1.5 % Vice Chairman of PREIT since 2004. – user1029296 Aug 07 '19 at 13:24
  • I see there are lot of examples, it would really help folks here, if you update your question and put 4-5 sentences of different sorts. Then it will help in finding the best possible solution – Rahul Agarwal Aug 07 '19 at 13:28
  • Thanks a lot. Working on extracting sentences now. I added two to the list. – user1029296 Aug 07 '19 at 13:28
  • @user1029296 I have created a solution that will work for all of your examples. – Sheshank S. Aug 07 '19 at 14:22
  • Beware of context: "Rohit Sharma (648) remains the highest run scorer in World Cup 2019". Impressive feat, given an age of 648 years, isn't it? – Has QUIT--Anony-Mousse Aug 07 '19 at 20:15

6 Answers6

5

Since your text has to be processed, and not only pattern matched, the correct approach is to use one of the many NLP tools available out there.

Your aim is to use Named Entity Recognition (NER) which is usually done based on Machine Learning Models. The NER activity attempts to recognize a determined set of Entity Types in text. Examples are: Locations, Dates, Organizations and Person names.

While not 100% precise, this is much more precise than simple pattern matching (especially for english), since it relies on other information other than Patterns, such as Part of Speech (POS), Dependency Parsing, etc.

Take a look on the results I obtained for the phrases you provided by using Allen NLP Online Tool (using fine-grained-NER model):

  • "Mr Bond, 67, is an engineer in the UK":

Mr Bond, 67, is an engineer in the UK

  • "Amanda B. Bynes, 34, is an actress"

Amanda B. Bynes, 34, is an actress

  • "Peter Parker (45) will be our next administrator"

Peter Parker (45) will be our next administrator

  • "Mr. Dylan is 46 years old."

Mr. Dylan is 46 years old.

  • "Steve Jones, Age: 32,"

Steve Jones, Age: 32,

Notice that this last one is wrong. As I said, not 100%, but easy to use.

The big advantage of this approach: you don't have to make a special pattern for every one of the millions of possibilities available.

The best thing: you can integrate it into your Python code:

pip install allennlp

And:

from allennlp.predictors import Predictor
al = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/fine- 
grained-ner-model-elmo-2018.12.21.tar.gz")
al.predict("Your sentence with date here")

Then, look at the resulting dict for "Date" Entities.

Same thing goes for Spacy:

!python3 -m spacy download en_core_web_lg
import spacy
sp_lg = spacy.load('en_core_web_lg')
{(ent.text.strip(), ent.label_) for ent in sp_lg("Your sentence with date here").ents}

(However, I had some bad experiences with bad predictions there - although it is considered better).

For more info, read this interesting article at Medium: https://medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b

duncte123
  • 15
  • 6
Tiago Duque
  • 1,956
  • 1
  • 12
  • 31
  • 1
    IMHO none of those examples is classified correctly as the target expressions are not dates, but ages. Dates also include expressions like "01.09.2001", "on Thursday 12th" and "yesterday" etc. which can generally be placed on a timeline. "47 years old" is clearly not the same kind of expression and should be distinguished from dates. So some (e.g. pattern-based) post-processing would be required to reclassify those DATEs as AGEs. – ongenz Aug 07 '19 at 18:33
  • @ongenz That is a noteworthy opinion. This is probably done due to Entity Label limitations - the model was trained to identify age as dates. It has to do with granularity and is part of an exchange: you want better results? Okay, let us generalize more with the ammount of data... However, isn't it easier to pattern extract a single (or maybe 3) patterns other than 1000's distinct number patterns? Also, it depends on the corpus used, maybe no date is presented. He could also check the closest date to a Person Entity. – Tiago Duque Aug 07 '19 at 18:49
  • yes I would have gone for a simple token-based pattern matching approach rather than a corpus-based NER model to begin with. But seeing as an answer was provided, my suggestion was intended to expand on it. – ongenz Aug 08 '19 at 09:57
1
import re 

x =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,"]

[re.findall(r'\d{1,3}', i)[0] for i in x] # ['67', '34', '45', '46', '32']
ComplicatedPhenomenon
  • 4,055
  • 2
  • 18
  • 45
  • I think he said that there will be percentages and money value as well, and this regex would pick that up as well – Sheshank S. Aug 07 '19 at 14:10
1

This will work for all the cases you provided: https://repl.it/repls/NotableAncientBackground

import re 

input =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,", "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation",
"George F. Rubin(14)(15) Age 68 Trustee since: 1997.",
"INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006",
"Mr. Lovallo, 47, was appointed Treasurer in 2011.",
"Mr. Charles Baker, 79, is a business advisor to biotechnology companies.",
"Mr. Botein, age 43, has been a member of our Board since our formation."]
for i in input:
  age = re.findall(r'Age[\:\s](\d{1,3})', i)
  age.extend(re.findall(r' (\d{1,3}),? ', i))
  if len(age) == 0:
    age = re.findall(r'\((\d{1,3})\)', i)
  print(i+ " --- AGE: "+ str(set(age)))

Returns

Mr Bond, 67, is an engineer in the UK --- AGE: {'67'}
Amanda B. Bynes, 34, is an actress --- AGE: {'34'}
Peter Parker (45) will be our next administrator --- AGE: {'45'}
Mr. Dylan is 46 years old. --- AGE: {'46'}
Steve Jones, Age:32, --- AGE: {'32'}
Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation --- AGE: set()
George F. Rubin(14)(15) Age 68 Trustee since: 1997. --- AGE: {'68'}
INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006 --- AGE: {'56'}
Mr. Lovallo, 47, was appointed Treasurer in 2011. --- AGE: {'47'}
Mr. Charles Baker, 79, is a business advisor to biotechnology companies. --- AGE: {'79'}
Mr. Botein, age 43, has been a member of our Board since our formation. --- AGE: {'43'}
Sheshank S.
  • 3,053
  • 3
  • 19
  • 39
0

a simple way to find the age of a person from your sentences will be to extract a number with 2 digits:

import re

sentence = 'Steve Jones, Age: 32,'
print(re.findall(r"\b\d{2}\b", 'Steve Jones, Age: 32,')[0])

# output: 32

if you do not want % to be at the end of your number and also you want to have a white space in the beginning you could do:

sentence = 'Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation'

match = re.findall(r"\b\d{2}(?!%)[^\d]", sentence)

if match:
    print(re.findall(r"\b\d{2}(?!%)[^\d]", sentence)[0][:2])
else:
    print('no match')

# output: no match

works well also for the previous sentence

Gustav Rasmussen
  • 3,720
  • 4
  • 23
  • 53
kederrac
  • 16,819
  • 6
  • 32
  • 55
0

Judging by the examples you have given, here is the strategy I propose:

Step 1:

Check if the statement has Age in the sentence Regex: (?i)(Age).*?(\d+)

The above will take care of examples like this:

-- George F. Rubin(14)(15) age 68 Trustee since: 1997.

-- Steve Jones, Age: 32

Step 2:

-- Check if "%" sign is the sentence, if Yes remove the number with the sign in it

-- If "Age" is not in the sentence then write a regex to remove all 4 digit numbers. Example regex: \b\d{4}\b

-- Then see if there are any digits remained in the sentence, that will be your age

Examples that get covered will be like:

--Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation" -No numbers will be left

--"INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006" -- Only 56 will be left

-- "Mr. Lovallo, 47, was appointed Treasurer in 2011." -- only 47 will be left

This may not be the complete answer as you can have other patterns also. But since you asked for strategy and the examples you posted, this would work in all the cases

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
0

Instead of using regex you could use Spacy pattern matching as well. The below patterns would work, though you may have to add a bit extra to ensure you do not pick up on percentages and money values.

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher 

age_patterns = [
# e.g Steve Jones, Age: 32,
[{"LOWER": "aged"}, {"IS_PUNCT": True,"OP":"?"},{"LIKE_NUM": True}],
[{"LOWER": "age"}, {"IS_PUNCT": True,"OP":"?"}, {"LIKE_NUM": True}],
# e.g "Peter Parker (45) will be our next administrator" OR "Amanda B. Bynes, 34, is an actress"
[{'POS':'PROPN'},{"IS_PUNCT": True}, {"LIKE_NUM": True}, {"IS_PUNCT": True}],
# e.g "Mr. Dylan is 46 years old."
[{"LIKE_NUM": True},{"IS_PUNCT": True,"OP":"*"},{"LEMMA": "year"}, {"IS_PUNCT": True,"OP":"*"},
 {"LEMMA": "old"},{"IS_ALPHA": True, "OP":"*"},{'POS':'PROPN',"OP":"*"},{'POS':'PROPN',"OP":"*"}  ]
]

doc = nlp(text)
matcher = Matcher(nlp.vocab) 
matcher.add("matching", age_patterns) 
matches = matcher(doc)

schemes = []
for i in range(0,len(matches)):

    # match: id, start, end
    start, end = matches[i][1], matches[i][2]

    if doc[start].pos_=='DET':
        start = start+1

    # matched string
    span = str(doc[start:end])

    if (len(schemes)!=0) and (schemes[-1] in span):
        schemes[-1] = span
    else:
        schemes.append(span)
Pelonomi Moiloa
  • 516
  • 5
  • 12