0

I've got a Pandas df that I use for Machine Learning in Scikit for Python. One of the columns is a target value which is continuous data (varying from -10 to +10).

From the target-column, I want to calculate a new column with 5 classes where the number of rows per class is the same, i.e. if I have 1000 rows I want to distribute into 5 classes with roughly 200 in each class.

So far, I have done this in Excel, separate from my Python code, but as the data has grown it's getting unpractical.

In Excel I have calculated the percentiles and then used some logic to build the classes.

How to do this in Python?

Jeff B
  • 8,572
  • 17
  • 61
  • 140
  • It would be useful to see what you've started with. Have you tried any code? Can you post a small example of what you're trying to do? – alexbclay Oct 14 '16 at 18:32
  • Thanks! Since I am a beginner I gave up on the code I had. Your example worked but when I put into my code I got problems. This is part of df('target'): 2016-08-30 3.679853 2016-08-31 4.786245 2016-09-01 3.060758 ... When I run I got this warning: A value is trying to be set on a copy of a slice from a DataFrame df['group'][df['target'] < quantiles[.8]] = 4 When I print(quantiles) I get the following: 0.2 NaN 0.4 NaN ... Also all values in group are set to '5'. I would think this is because of the NaN in quantile. – SpreadTrader Oct 14 '16 at 22:56

2 Answers2

0
#create data
import numpy as np
import pandas as pd
df = pd.DataFrame(20*np.random.rand(50, 1)-10, columns=['target'])   

#find quantiles
quantiles = df['target'].quantile([.2, .4, .6, .8])
#labeling of groups
df['group'] = 5
df['group'][df['target'] < quantiles[.8]] = 4
df['group'][df['target'] < quantiles[.6]] = 3       
df['group'][df['target'] < quantiles[.4]] = 2 
df['group'][df['target'] < quantiles[.2]] = 1 
David
  • 11,245
  • 3
  • 41
  • 46
0

looking for an answer to similar question found this post and the following tip: What is the difference between pandas.qcut and pandas.cut?

import numpy as np
import pandas as pd

#generate 1000 rows of uniform distribution between -10 and 10
rows = np.random.uniform(-10, 10, size = 1000)

#generate the discretization in 5 classes
rows_cut = pd.qcut(rows, 5)
classes = rows_cut.factorize()[0]
Waliston
  • 17
  • 1
  • 3