2

I have a dataframe of "sentences", from which I wish to search for a keyword. Let's say that my keyword is just the letter 'A'. Sample data:

year | sentence | index
-----------------------
2015 | AAX      | 0
2015 | BAX      | 1
2015 | XXY      | -1
2016 | AWY      | 0
2017 | BWY      | -1

That is, the "index" column shows the index of the first occurence of "A" in each sentence (-1 if not found). I want to group up the rows into their respective years, with a column showing the percentage of occurences of 'A' in the records of each year. That is:

year | index
-------------
2015 | 0.667
2016 | 1.0
2017 | 0

I have a feeling that this involves agg or groupby in some fashion, but I'm not clear how to string these together. I've gotten as far as:

df.groupby("index").count()

But the issue here is some kind of conditional count() first, where we first count the number of rows in year 201X containing 'A', then dividing that by the number of rows in year 201X.

AndreyIto
  • 954
  • 1
  • 14
  • 35

4 Answers4

2

You can use value_counts or GroupBy.size with boolean indexing:

What is the difference between size and count in pandas?

df2 = df['year'].value_counts()
print (df2)
2015    3
2017    1
2016    1
Name: year, dtype: int64

df1 = df.loc[df['index'] != -1, 'year'].value_counts()
print (df1)
2015    2
2016    1
Name: year, dtype: int64

Or:

df2 = df.groupby('year').size()
print (df2)
year
2015    3
2016    1
2017    1
dtype: int64

df1 = df.loc[df['index'] != -1, ['year']].groupby('year').size()
print (df1)
year
2015    2
2016    1
dtype: int64

And last divide by div:

print (df1.div(df2, fill_value=0))
2015    0.666667
2016    1.000000
2017    0.000000
Name: year, dtype: float64
Graham
  • 7,431
  • 18
  • 59
  • 84
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

There are different ways to do it but no 'native' way as far as I know. Here's one example, with only one grouby:

g = df.groupby('year')['index'].agg([lambda x: x[x>=0].count(), 'count'])
g['<lambda>'] / g['count']

Check also:

Alex
  • 579
  • 3
  • 13
0
from __future__ import division
import pandas as pd
x_df = # your dataframe

y = x_df.groupby('year')['sentence'].apply(lambda x: sum(True if i.count('A') >0 else False for i in x)/len(x))

#or

y = x.groupby('year')['index'].apply(lambda x: sum(True if i >=0 else False for i in x)/len(x))
0

Using sentence to check

df.sentence.str.contains('A').groupby(df.year).mean()

year
2015    0.666667
2016    1.000000
2017    0.000000
Name: sentence, dtype: float64

Using index that has already checked

df['index'].ne(-1).groupby(df.year).mean()

year
2015    0.666667
2016    1.000000
2017    0.000000
Name: index, dtype: float64
piRSquared
  • 285,575
  • 57
  • 475
  • 624