1

I am trying to identify if a column Blaze[Info] contains within the text a string from a list (and create a new Boolean column with that information).

The DataFrame looks like:

       Word          Info
0      Aam           Aam, n. Etym: [D. aam, fr. LL. ama; cf. L. ham...
1      aard-vark     Aard"-vark`, n. Etym: [D., earth-pig.] (Zoöl.)
2      aard-wolf     Aard"-wolf`, n. Etym: [D, earth-wolf] (Zoöl.)

When I state the term directly I get the answer I want:

Blaze['Noun'] = np.where((Blaze['Info'].str.contains('n.')),True,False) Blaze['Verb'] = np.where((Blaze['Info'].str.contains('v.')),True,False)

       Word          Info                                                Noun   Verb
0      Aam           Aam, n. Etym: [D. aam, fr. LL. ama; cf. L. ham...   True   False
1      aard-vark     Aard"-vark`, n. Etym: [D., earth-pig.] (Zoöl.)      True   False
2      aard-wolf     Aard"-wolf`, n. Etym: [D, earth-wolf] (Zoöl.)       True   False

but this is not scalable as I have 100+ features to search for.

When I iterate through the list abbreviation :

abbreviation=['n'., 'v.']
col_name=['Noun','Verb']

for i in range(len(abbreviation)):
    Blaze[col_name[i]] = np.where((Blaze['Info'].str.contains(abbreviation[i])), True, False)

I am returned DataFrame full of 'FALSE' entries:

       Word          Info                                                Noun   Verb
0      Aam           Aam, n. Etym: [D. aam, fr. LL. ama; cf. L. ham...   False  False
1      aard-vark     Aard"-vark`, n. Etym: [D., earth-pig.] (Zoöl.)      False  False
2      aard-wolf     Aard"-wolf`, n. Etym: [D, earth-wolf] (Zoöl.)       False  False

I can see several answers for doing something similar but grouping the answer in a single row: Check if each row in a pandas series contains a string from a list using apply?

Scalable solution for str.contains with list of strings in pandas

but I don't think these solve the above.

Is anyone able to explain how I am going wrong?

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
Alex
  • 87
  • 8
  • Your code works for me -- the only thing you need to do is pass `regex=False` a s`.` is a regex character meaning the second row would be `True` since there is a `z` in it. There is nothing wrong with the way you are doing the loop, but I think my way may be slightly more pythonic. – David Erickson Dec 10 '20 at 23:18

1 Answers1

1

You can loop through the lists simultaneously with zip. Make sure to pass regex=False to str.contains as . is a regex character.

abbreviation=['n.', 'v.']
col_name=['Noun','Verb']
for a, col in zip(abbreviation, col_name):
    Blaze[col] = np.where(Blaze['Info'].str.contains(a, regex=False),True,False)
Blaze
Out[1]: 
        Word                                               Info  Noun   Verb
0        Aam  Aam, n. Etym: [D. aam, fr. LL. ama; cf. L. ham...  True  False
1  aard-vark     Aard"-vark`, n. Etym: [D., earth-pig.] (Zoöl.)  True  False
2  aard-wolf      Aard"-wolf`, n. Etym: [D, earth-wolf] (Zoöl.)  True  False

If required, str.contains also has a case parameter, so you can specify case=False to search case-insensitively.

David Erickson
  • 16,433
  • 2
  • 19
  • 35