How to extract age and gender of the person from unprocessed text/data?

Question

i have a CSV file with list of texts(column with rows), and i want to extract the ages of the patients from the each row, i can't do with "is digit" cuz there are also some other digits in the texts. how can i do such thing? Thank You

EXTRA: i want to extract the genders too - Patient sometimes is referred as male/female, sometimes as man/woman and sometimes as gentleman/lady.

Is there a method to write the findall for example if the text is 17-year-old print me the number if it is followed by -year-old

re.findall("[\d].", '-year-old')

Sample of lines from text:

This 23-year-old white female presents with...

...pleasant gentleman who is 42 years old...

...The patient is a 10-1/2-year-old born with...

...A 79-year-old Filipino woman...

Patient, 37,...

How can i have a list of age/gender

i.e.:

Age:

    ['23','42','79','37'...]

Gender:

    ['female','male','male','female','male'...]

Your question has already been answered. Check [here](https://stackoverflow.com/questions/57395165/extracting-a-persons-age-from-unstructured-text-in-python) — Darkknight, May 03 '20 at 03:14

score 1 · Answer 1 · answered May 03 '20 at 00:29

re_list = [
    '\d*\-year-old',
    '\d*\ year old'
]

matches = []
for r in re_list:
    matches += re.findall( r, 'pleasant gentleman who is 42 years old, This 23-year-old white female presents with')
print(matches)

prints out:

['23-year-old', '42 year old']

score 0 · Accepted Answer · answered May 02 '20 at 15:38

0

you can do that easily using regex (Regular Expression).

import re

# returns all numbers
age = re.findall("[\d].", your_text)

# returns all words related to gender
gender = re.findall("female|gentleman|woman", your_text)

The gender part you can use a dict to treat get your right answer

gender_dict = {"male": ["gentleman", "man", "male"],
               "female": ["female", "woman", "girl"]}
gender_aux = []
for g in gender:
    if g in gender_dict['male']:
        gender_aux.append('male')
    elif g in gender_dict['female']:
        gender_aux.append('female')

answered May 02 '20 at 15:38

Gabriel Soares

396
4
5

Thank you, but the point is there are also some other digits in the text, so it would not really function for me re.findall. Is there a method to write the findall for example if the text is **17-year-old** print me the number if it is followed by -year-old `re.findall("[\d].", '-year-old')` – leocleo May 02 '20 at 19:52
'he is male, and the male is 25 years" in this condition male will be appended 2 times in the list? so shall i avoid it by deleting the duplicates, or is there another method too? – leocleo May 03 '20 at 15:13

How to extract age and gender of the person from unprocessed text/data?

2 Answers2