Classification of continious data

Question

I've got a Pandas df that I use for Machine Learning in Scikit for Python. One of the columns is a target value which is continuous data (varying from -10 to +10).

From the target-column, I want to calculate a new column with 5 classes where the number of rows per class is the same, i.e. if I have 1000 rows I want to distribute into 5 classes with roughly 200 in each class.

So far, I have done this in Excel, separate from my Python code, but as the data has grown it's getting unpractical.

In Excel I have calculated the percentiles and then used some logic to build the classes.

How to do this in Python?

It would be useful to see what you've started with. Have you tried any code? Can you post a small example of what you're trying to do? — alexbclay, Oct 14 '16 at 18:32
Thanks! Since I am a beginner I gave up on the code I had. Your example worked but when I put into my code I got problems. This is part of df('target'): 2016-08-30 3.679853 2016-08-31 4.786245 2016-09-01 3.060758 ... When I run I got this warning: A value is trying to be set on a copy of a slice from a DataFrame df['group'][df['target'] < quantiles[.8]] = 4 When I print(quantiles) I get the following: 0.2 NaN 0.4 NaN ... Also all values in group are set to '5'. I would think this is because of the NaN in quantile. — SpreadTrader, Oct 14 '16 at 22:56

David · Answer 1 · 2016-10-14T17:27:39.563

#create data
import numpy as np
import pandas as pd
df = pd.DataFrame(20*np.random.rand(50, 1)-10, columns=['target'])   

#find quantiles
quantiles = df['target'].quantile([.2, .4, .6, .8])
#labeling of groups
df['group'] = 5
df['group'][df['target'] < quantiles[.8]] = 4
df['group'][df['target'] < quantiles[.6]] = 3       
df['group'][df['target'] < quantiles[.4]] = 2 
df['group'][df['target'] < quantiles[.2]] = 1

score 0 · Answer 2 · answered Dec 22 '19 at 19:02

looking for an answer to similar question found this post and the following tip: What is the difference between pandas.qcut and pandas.cut?

import numpy as np
import pandas as pd

#generate 1000 rows of uniform distribution between -10 and 10
rows = np.random.uniform(-10, 10, size = 1000)

#generate the discretization in 5 classes
rows_cut = pd.qcut(rows, 5)
classes = rows_cut.factorize()[0]

Classification of continious data

2 Answers2