1

Suppose I have a DataFrame, in which one of the columns (we'll call it 'power') holds integer values from 1 to 10000. I would like to produce a numpy array which has, for each row, a value indicating whether the corresponding row of the DataFrame has a value in the 'power' column which is greater than 9000.

I could do something like this:

def categorize(frame):
    return np.array(frame['power']>9000)

This will give me a boolean array which can be tested against with True and False. However, suppose I want the contents of the array to be 1 and -1, rather than True and False. How can I accomplish this without having to iterate through each row in the frame?

For background, the application is preparing data for binary classification via machine learning with scikit-learn.

PTTHomps
  • 1,477
  • 2
  • 22
  • 38

1 Answers1

2

You can use np.where for this type of stuff.

Consider the following:

import pandas as pd

df = pd.DataFrame({
    'a': range(20)})
df['even'] = df.a % 2 == 0

So now even is a boolean column. To create an array the way you like, you can use

np.where(df.even, 1, -1)

You can assign this back to the DataFrame, if you like:

df['foo'] = np.where(df.even, 1, -1)

See the pandas cookbook further for this sort of stuff.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185