0

So, I used python package "gender-guesser" to detect the gender of the person based on their names. However, I want to identify the gender from a sentence that does not have the person name.

Suppose I have the below sentence:

"Prior to you with a 14 year old male, who got out of bed and had some sort of syncopal episode."

The sentence is just an example and only has the word male and not the person's name. But, the input can contain may contain other words like boy, girl, lady, transgender, guy, woman, man, unknown, etc.

This is what I am currently trying to do, but may not be correct for what I want the end result:

#original string
wordlist=tokens
# using split() function

# total no of words
male_count=0
female_count=0

for i in range(len(wordlist)):
  if wordlist[i]==('male' or 'boy' or 'guy' or 'man'):
    print(i)
    male_count= male_count+1
  
  else: 
    if wordlist[i]==('female' or 'girl' or 'lady' or 'woman'):
      female_count= female_count+1

Is there a better way to identify the gender?

Vishal Rana
  • 43
  • 1
  • 7

1 Answers1

2

A few ways to improve:

  1. instead of if wordlist[i]==('male' or 'boy' or 'guy' or 'man'), you can check if wordlist[i] in ['male', 'boy', 'guy', 'man']. Same is valid for females.
  2. Not a big deal but instead of list (i.e., ['male', 'boy', 'guy', 'man']), you can create a set as set(['male', 'boy', 'guy', 'man']), same for females.
  3. No need for the else.
  4. You can use a += 1 instead of a = a + 1 which does the same job.
  5. You don't need to iterate over range(len(wordlist)). You can just iterate over word_list

So, your code can be cleaned up a little as follows:

male_count = 0
female_count = 0

male_categories = set(['male', 'boy', 'guy', 'man'])
female_categories = set(['female', 'girl', 'lady', 'woman'])
for word in wordlist:
    if word in male_categories:
        male_count += 1
    if word in female_categories:
        female_count += 1

There are different ways to do this as well, such as counting males + boys + guy + man in the list which would be one or two lines. But I think this is a better start and easier to understand.

smttsp
  • 4,011
  • 3
  • 33
  • 62