3

I have a pandas df with a column A, which is a string of strings. Each item in the series (i.e. each row in the database) is just one long string, separated by commas. I would like to create a new column called B that incrementally counts everytime an object from a separate list appears in each row of column A. For instance:

my list looks like this:

list = ('dog', 'bird', 'cat')

my dataframe looks like this:

A                           B
dog, bird                   2
cat, bird                   2
dog, snake                  1
cat, bird, snake            2
dog, bird, cat, snake       3
dog, bird cat               3

i'm trying to create a nested loop that does the following: start at df.a[0] (i.e. the first value of df.A), find out if it contains the first value of the list (i.e. 'dog'). If df.A[0] contains, then add 1 to B. then, staying in the same row of df.A, move on to the second value of the list (i.e. 'bird'). If that df.A[0] also contains this value, then add another 1 to B. etc etc.

This is the code I'm trying to use.

for i in df['A']:
    for j in list:
        if i.str.contains(j):
            df['B'] += 1

However, I keep getting the error:

'str' object has no attribute 'str'

How can I tell pandas to look at the entire series, while also telling it to function as a loop with the structure identified above? Or alternatively, what is the best way to solve this problem?

cs95
  • 379,657
  • 97
  • 704
  • 746
sne89
  • 31
  • 1
  • 5
  • 1
    `i` will be string since you are getting it from `A`. Try `i.split(',')` to get a list out of it. Also on a side note, variable name `list` is a bad idea. – harvpan May 08 '18 at 20:15

2 Answers2

6

A couple of notes -

  1. Don't iterate over a DataFrame when you can avoid it. Always look to vectorize. If you cannot, only then use a list comprehension
  2. When iterating over the column, you're iterating over individual string items. Those do not have an .str attribute.
  3. Don't use list and other similar names (dict, tuple) to name variables/objects, they shadow the builtin. I've renamed your variable to substr below.

A KISS solution would involve str.findall + str.len. No splitting required.

substr = ('dog', 'bird', 'cat')
df['B'] = df['A'].str.findall('|'.join(substr)).str.len()

df['B']

0    2
1    2
2    1
3    2
4    3
5    3
Name: A, dtype: int64

If you have large strings and a lot of substrings, you may want to look at using the Aho-Corasick algorithm.

cs95
  • 379,657
  • 97
  • 704
  • 746
4

Another way to achieve what you need involved get_dummies

df.A.str.get_dummies(', ').loc[:,lst].sum(1)
Out[849]: 
0    2
1    2
2    1
3    2
4    3
5    3
dtype: int64

Or we just do str.split with isin

df.A.str.split(', ',expand=True).isin(lst).sum(1)
Out[853]: 
0    2
1    2
2    1
3    2
4    3
5    3
dtype: int64
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Hmm, what is `lst` for you? This isn't giving me the same output. Also, in the last row, bird and cat do not have a comma separating them. So will this still work? – cs95 May 08 '18 at 20:23
  • @cᴏʟᴅsᴘᴇᴇᴅ lst = ['dog', 'bird', 'cat'] – BENY May 08 '18 at 20:24
  • I get `Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative.`. Any idea? I think it's because there's a column called "bird cat" that gets generated which is causing problems. – cs95 May 08 '18 at 20:25
  • For this to work, the arg should have whitespace `get_dummies(', ')` I believe. Last row most likely a typo? – rafaelc May 08 '18 at 20:25
  • @RafaelC yep, I think so , also , if the last lane is not type , it should not count in my result – BENY May 08 '18 at 20:26
  • 1
    If `list` might contain elements not contained in any row, then you may want to intersect `list` with `df.A.str.get_dummies(', ').columns` before using `loc` – BallpointBen May 08 '18 at 21:08