0

I have a dataset similar to the following.

date,score
3/1/16,0.6369
5/1/16,-0.2023
6/1/16,0.04
7/1/16,0.0772
9/1/16,-0.4215
12/1/16,0.2960
15/1/16,0.25
15/1/16,0.7684

I want to apply the following conditions on the score.

Con1: if the score is >.05, count that as positive for that date
Con2: if the score is  -0.05<=score <=.05, count that as neutral for that date
Con3: Else, count that as negative for that date
And add a new_column to the DataFrame alongside the score to put the 'negative'/'positive'/'neutral' result

Expected Output:

date, score, mood
3/1/16,0.6369, positive
5/1/16,-.2023, negative
6/1/16,0.04, neutral

And I have multiple scores on the same date. So, I thought of using groupby with multiple columns ('date'and 'score') and pass through the if conditions and add a new column ['mood'] to the DataFrame.

What I have tried:

df =pd.read_csv('file.csv')
def SortMood(df)
df['mood']=[] #empty column as a list in the df to store the mood 
 for score in df['score']:
      if score>(0.05):
            df['mood'].append('positive')
      elif -0.05<=score <=.05:
            df['mood'].append('neutral')
      else:
          df['mood'].append('negative')

I am aware that this function is wrong (I get a ValueError). So, any help is appreciated. Thank you.

non_linear
  • 413
  • 7
  • 15
  • What is your expected output when different scores for the same date give different answers? – andrew_reece Feb 16 '20 at 01:33
  • @andrew_reece it is according to the if/else statements. Actually, I have a lot of different scores for the same date. Thanks – non_linear Feb 16 '20 at 01:36
  • Does this answer your question? [Add new column to Python Pandas DataFrame based on multiple conditions](https://stackoverflow.com/questions/49586471/add-new-column-to-python-pandas-dataframe-based-on-multiple-conditions) – oreopot Feb 16 '20 at 01:37
  • 1
    @Non_linear what is unclear is how the if/else statements should apply in the case where you have multiple evaluations for the same date. and are you expecting only one output per date? if so, that will require a `groupby`. It'd be helpful if you could (a) specify your complete expected output and (b) include an edge case with two different results for the same date. (Currently 15/1/16 has two entries but they both evaluate to positive.) – andrew_reece Feb 16 '20 at 01:43
  • @andrew_reece I should have probably put that. But, your suggested solution takes care of this issue as well. I just checked the output result on a single date with 3 different cases (positive, negative and neutral) with different scores and it seems to classify all of them correctly. Thanks. – non_linear Feb 16 '20 at 01:48

3 Answers3

2

Use pd.cut to bin your data to categorical:

df['mood'] = pd.cut(df['score'], 
                    bins=[-np.inf, -.05, .05, np.inf], 
                    labels=['negative', 'neutral', 'positive'])

      date   score      mood
0   3/1/16  0.6369  positive
1   5/1/16 -0.2023  negative
2   6/1/16  0.0400   neutral
3   7/1/16  0.0772  positive
4   9/1/16 -0.4215  negative
5  12/1/16  0.2960  positive
6  15/1/16  0.2500  positive
7  15/1/16  0.7684  positive

Or use numpy.select for vectorized multiple conditional column:

conditions = [
    df['score'].lt(-.05),
    df['score'].between(-.05, 0.05)
]

df['mood'] = np.select(conditions, ['negative', 'neutral'], default='positive')

      date   score      mood
0   3/1/16  0.6369  positive
1   5/1/16 -0.2023  negative
2   6/1/16  0.0400   neutral
3   7/1/16  0.0772  positive
4   9/1/16 -0.4215  negative
5  12/1/16  0.2960  positive
6  15/1/16  0.2500  positive
7  15/1/16  0.7684  positive
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • 1
    This is almost certainly a faster solution than using `apply`. To OP, if efficiency is important, I would suggest going with @Erfan's answer. – andrew_reece Feb 16 '20 at 01:44
  • 1
    I always forget about `cut`, great solution +1 – LeoE Feb 16 '20 at 01:46
  • 1
    Yes. This is definitely much faster than andrew_reece's solution. And works very well for my humongous dataset (millions of rows). I will change it. Thanks. – non_linear Feb 16 '20 at 01:54
1

Use apply:

def determine_mood(row, thresh=.05):
    if row.score < -thresh:
        return "negative"
    elif row.score > thresh:
        return "positive"
    else:
        return "neutral"

df["mood"] = df.apply(determine_mood, axis=1)

df
      date   score      mood
0   3/1/16  0.6369  positive
1   5/1/16 -0.2023  negative
2   6/1/16  0.0400   neutral
3   7/1/16  0.0772  positive
4   9/1/16 -0.4215  negative
5  12/1/16  0.2960  positive
6  15/1/16  0.2500  positive
7  15/1/16  0.7684  positive
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
1

There are multiple ways to do it. Very similar to your approach would be the apply method of pandas:

def get_mood(row):
    if row['score'] > 0.05:
        return 'positive'
    elif row['score'] > -0.05:
        return 'neutral'
    else:
        return 'negative'
df['mood'] = df.apply(get_mood, axis=1)

Or shorten it down with lambda expression to:

df['mood'] = df.apply(lambda x: 'positive' if x > 0.05 else ('neutral' if x > -0.05 else 'negative'), axis=1)

Or use map:

df.loc[:,'mood'] = df['score'].map(lambda x: 'positive' if x > 0.05 else ('neutral' if x > -0.05 else 'negative'))

I think this should be a lot faster, than apply

LeoE
  • 2,054
  • 1
  • 11
  • 29