0

I'm working on mining survey data. I was able to flag the rows for certain keywords:

survey['Rude'] = survey['Comment Text'].str.contains('rude', na=False, regex=True).astype(int)

Now, I want to flag any rows containing names. I have another dataframe that contains common US names. Here's what I thought would work, but it is not flagging any rows, and I have validated that names do exist in the 'Comment Text'

for row in survey:   
    for word in survey['Comment Text']:
        survey['Name'] = 0
        if word in names['Name']:
            survey['Name'] = 1
Justin
  • 5
  • 2

1 Answers1

0

You are not looping through the series correctly. for row in survey: loops through the column names in survey. for word in survey['Comment Text']: loops though the comment strings. survey['Name'] = 0 creates a column of all 0s.

You could use set intersections and apply(), to avoid all the looping through rows:

    survey = pd.DataFrame({'Comment_Text':['Hi rcriii',
                                           'Hi yourself stranger',
                                           'say hi to Justin for me']})
    names = pd.DataFrame({'Name':['rcriii', 'Justin', 'Susan', 'murgatroyd']})
    s2 = set(names['Name'])

    def is_there_a_name(s):
        s1 = set(s.split())
        if len(s1.intersection(s2))>0:
            return 1
        else:
            return 0

    survey['Name'] = survey['Comment_Text'].apply(is_there_a_name)

    print(names)
    print(survey)

         Name
0      rcriii
1      Justin
2       Susan
3  murgatroyd
              Comment_Text  Name
0                Hi rcriii     1
1     Hi yourself stranger     0
2  say hi to Justin for me     1

As a bonus, return len(s1.intersection(s2)) to get the number of matches per line.

rcriii
  • 687
  • 6
  • 9