I have a dataset of administrative filings that include short biographies. I am trying to extract people's ages by using python and some pattern matching. Some example of sentences are:
- "Mr Bond, 67, is an engineer in the UK"
- "Amanda B. Bynes, 34, is an actress"
- "Peter Parker (45) will be our next administrator"
- "Mr. Dylan is 46 years old."
- "Steve Jones, Age: 32,"
These are some of the patterns I have identified in the dataset. I want to add that there are other patterns, but I have not run into them yet, and not sure how I could get to that. I wrote the following code that works pretty well, but is pretty inefficient so will take too much time to run on the whole dataset.
#Create a search list of expressions that might come right before an age instance
age_search_list = [" " + last_name.lower().strip() + ", age ",
" " + clean_sec_last_name.lower().strip() + " age ",
last_name.lower().strip() + " age ",
full_name.lower().strip() + ", age ",
full_name.lower().strip() + ", ",
" " + last_name.lower() + ", ",
" " + last_name.lower().strip() + " \(",
" " + last_name.lower().strip() + " is "]
#for each element in our search list
for element in age_search_list:
print("Searching: ",element)
# retrieve all the instances where we might have an age
for age_biography_instance in re.finditer(element,souptext.lower()):
#extract the next four characters
age_biography_start = int(age_biography_instance.start())
age_instance_start = age_biography_start + len(element)
age_instance_end = age_instance_start + 4
age_string = souptext[age_instance_start:age_instance_end]
#extract what should be the age
potential_age = age_string[:-2]
#extract the next two characters as a security check (i.e. age should be followed by comma, or dot, etc.)
age_security_check = age_string[-2:]
age_security_check_list = [", ",". ",") "," y"]
if age_security_check in age_security_check_list:
print("Potential age instance found for ",full_name,": ",potential_age)
#check that what we extracted is an age, convert it to birth year
try:
potential_age = int(potential_age)
print("Potential age detected: ",potential_age)
if 18 < int(potential_age) < 100:
sec_birth_year = int(filing_year) - int(potential_age)
print("Filing year was: ",filing_year)
print("Estimated birth year for ",clean_sec_full_name,": ",sec_birth_year)
#Now, we save it in the main dataframe
new_sec_parser = pd.DataFrame([[clean_sec_full_name,"0","0",sec_birth_year,""]],columns = ['Name','Male','Female','Birth','Suffix'])
df_sec_parser = pd.concat([df_sec_parser,new_sec_parser])
except ValueError:
print("Problem with extracted age ",potential_age)
I have a few questions:
- Is there a more efficient way to extract this information?
- Should I use a regex instead?
- My text documents are very long and I have lots of them. Can I do one search for all the items at once?
- What would be a strategy to detect other patterns in the dataset?
Some sentences extracted from the dataset:
- "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation"
- "George F. Rubin(14)(15) Age 68 Trustee since: 1997."
- "INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006"
- "Mr. Lovallo, 47, was appointed Treasurer in 2011."
- "Mr. Charles Baker, 79, is a business advisor to biotechnology companies."
- "Mr. Botein, age 43, has been a member of our Board since our formation."