Separate by threshold

Question

I am trying to take the value of c_med one value as threshold from input:1 and separate the above and below values in two different outputs from input:2. Write above.csv & below.csv with reference to column c_total.

Read the above.csv as input and categorize them with percentage as mentioned in point 2 written in pure python.

Input: 1

date_count,all_hours,c_min,c_max,c_med,c_med_med,u_min,u_max,u_med,u_med_med
2,12,2309,19072,12515,13131,254,785,686,751

Input: 2 ['date','startTime','endTime','day','c_total','u_total']

2004-01-05,22:00:00,23:00:00,Mon,18944,790
2004-01-05,23:00:00,00:00:00,Mon,17534,750
2004-01-06,00:00:00,01:00:00,Tue,17262,747
2004-01-06,01:00:00,02:00:00,Tue,19072,777
2004-01-06,02:00:00,03:00:00,Tue,18275,785
2004-01-06,03:00:00,04:00:00,Tue,13589,757
2004-01-06,04:00:00,05:00:00,Tue,16053,735
2004-01-06,05:00:00,06:00:00,Tue,11440,636
2004-01-06,06:00:00,07:00:00,Tue,5972,513
2004-01-06,07:00:00,08:00:00,Tue,3424,382
2004-01-06,08:00:00,09:00:00,Tue,2696,303
2004-01-06,09:00:00,10:00:00,Tue,2350,262
2004-01-06,10:00:00,11:00:00,Tue,2309,254

I am trying to read a threshold value from another input csv c_med

I am getting following error:

Traceback (most recent call last):
  File "class_med.py", line 10, in <module>
    above_median = df_data['c_total'] > df_med['c_med']
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 735, in wrapper
    raise ValueError('Series lengths must match to compare')
ValueError: Series lengths must match to compare

filter the separated data column c_total with percentage. Pure python solution given below but I am looking for a pandas solution. like in Reference one

for row in csv.reader(inp): if int(row[1])<(.20 * max_value): val = 'viewers' elif int(row[1])>=(0.20*max_value) and int(row[1])<(0.40*max_value): val= 'event based' elif int(row[1])>=(0.40*max_value) and int(row[1])<(0.60*max_value): val= 'situational' elif int(row[1])>=(0.60*max_value) and int(row[1])<(0.80*max_value): val = 'active' else: val= 'highly active' writer.writerow([row[0],row[1],val])

Code:

import pandas as pd 
import numpy as np

df_med = pd.read_csv('stat_result.csv')
df_med.columns = ['date_count', 'all_hours', 'c_min', 'c_max', 'c_med', 'c_med_med', 'u_min', 'u_max', 'u_med', 'u_med_med']

df_data = pd.read_csv('mini_out.csv')
df_data.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']

above = df_data['c_total'] > df_med['c_med']

#print above_median

above.to_csv('above.csv', index=None, header=None)

df_above = pd.readcsv('above_median.csv')
df_above.columns = ['date', 'startTime', 'endTime', 'day', 'c_total', 'u_total']

#Percentage block should come here

Edit: In case of single column value the qcut is the simplest solution. But when it comes to using two values from two different columns how to achieve that in pandas ?

for row in csv.reader(inp):
        if int(row[1])>(0.80*max_user) and int(row[2])>(0.80*max_key):
            val='highly active'
        elif int(row[1])>=(0.60*max_user) and int(row[2])<=(0.60*max_key):
            val='active'
        elif int(row[1])<=(0.40*max_user) and int(row[2])>=(0.40*max_key):  
            val='event based'
        elif int(row[1])<(0.20*max_user) and int(row[2])<(0.20*max_key):
            val ='situational'
        else:
            val= 'viewers'

score 1 · Accepted Answer · answered Aug 14 '16 at 09:44

assuming you have the following DFs:

In [7]: df1
Out[7]:
   date_count  all_hours  c_min  c_max  c_med  c_med_med  u_min  u_max  u_med  u_med_med
0           2         12   2309  19072  12515      13131    254    785    686        751

In [8]: df2
Out[8]:
          date startTime   endTime  day  c_total  u_total
0   2004-01-05  22:00:00  23:00:00  Mon    18944      790
1   2004-01-05  23:00:00  00:00:00  Mon    17534      750
2   2004-01-06  00:00:00  01:00:00  Tue    17262      747
3   2004-01-06  01:00:00  02:00:00  Tue    19072      777
4   2004-01-06  02:00:00  03:00:00  Tue    18275      785
5   2004-01-06  03:00:00  04:00:00  Tue    13589      757
6   2004-01-06  04:00:00  05:00:00  Tue    16053      735
7   2004-01-06  05:00:00  06:00:00  Tue    11440      636
8   2004-01-06  06:00:00  07:00:00  Tue     5972      513
9   2004-01-06  07:00:00  08:00:00  Tue     3424      382
10  2004-01-06  08:00:00  09:00:00  Tue     2696      303
11  2004-01-06  09:00:00  10:00:00  Tue     2350      262
12  2004-01-06  10:00:00  11:00:00  Tue     2309      254

separate by threshold (you can compare two series with the same length or with a scalar value - i assume you will to separate your second data set, comparing it to the scalar value (c_med column) from the first of your first data set:

In [22]: above = df2[df2.c_total > df1.ix[0, 'c_med']]

In [23]: above
Out[23]:
         date startTime   endTime  day  c_total  u_total
0  2004-01-05  22:00:00  23:00:00  Mon    18944      790
1  2004-01-05  23:00:00  00:00:00  Mon    17534      750
2  2004-01-06  00:00:00  01:00:00  Tue    17262      747
3  2004-01-06  01:00:00  02:00:00  Tue    19072      777
4  2004-01-06  02:00:00  03:00:00  Tue    18275      785
5  2004-01-06  03:00:00  04:00:00  Tue    13589      757
6  2004-01-06  04:00:00  05:00:00  Tue    16053      735

you can use qcut() method in order to categorize your data:

In [29]: df2['cat'] = pd.qcut(df2.c_total,
   ....:                        q=[0, .2, .4, .6, .8, 1.],
   ....:                        labels=['viewers','event based','situational','active','highly active'])

In [30]: df2
Out[30]:
          date startTime   endTime  day  c_total  u_total            cat
0   2004-01-05  22:00:00  23:00:00  Mon    18944      790  highly active
1   2004-01-05  23:00:00  00:00:00  Mon    17534      750         active
2   2004-01-06  00:00:00  01:00:00  Tue    17262      747         active
3   2004-01-06  01:00:00  02:00:00  Tue    19072      777  highly active
4   2004-01-06  02:00:00  03:00:00  Tue    18275      785  highly active
5   2004-01-06  03:00:00  04:00:00  Tue    13589      757    situational
6   2004-01-06  04:00:00  05:00:00  Tue    16053      735    situational
7   2004-01-06  05:00:00  06:00:00  Tue    11440      636    situational
8   2004-01-06  06:00:00  07:00:00  Tue     5972      513    event based
9   2004-01-06  07:00:00  08:00:00  Tue     3424      382    event based
10  2004-01-06  08:00:00  09:00:00  Tue     2696      303        viewers
11  2004-01-06  09:00:00  10:00:00  Tue     2350      262        viewers
12  2004-01-06  10:00:00  11:00:00  Tue     2309      254        viewers

check:

In [32]: df2.assign(pct=df2.c_total/df2.c_total.max())[['c_total','pct','cat']]
Out[32]:
    c_total       pct            cat
0     18944  0.993289  highly active
1     17534  0.919358         active
2     17262  0.905096         active
3     19072  1.000000  highly active
4     18275  0.958211  highly active
5     13589  0.712510    situational
6     16053  0.841705    situational
7     11440  0.599832    situational
8      5972  0.313129    event based
9      3424  0.179530    event based
10     2696  0.141359        viewers
11     2350  0.123217        viewers
12     2309  0.121068        viewers

Thank you for the best and simple solution. May I also request `qcut` for two values ? Please check the edit section of the question. Thanks again ! — Sitz Blogz, Aug 17 '16 at 06:22
Error: `Traceback (most recent call last): File "class_med.py", line 13, in df2['cat'] = pd.qcut(df2.c_total, q=[0, .4, .7, 1.], labels=['not populated', 'fairly populated', 'populated', 'highly populated']) File "/usr/local/lib/python2.7/dist-packages/pandas/tools/tile.py", line 173, in qcut precision=precision, include_lowest=True) File "/usr/local/lib/python2.7/dist-packages/pandas/tools/tile.py", line 217, in _bins_to_cuts raise ValueError('Bin labels must be one fewer than ' ValueError: Bin labels must be one fewer than the number of bin edges` — Sitz Blogz, Aug 18 '16 at 21:41
@SitzBlogz, you have four labels for three intervals: `[0, .4), [.4, .7), [.7, 1.]`. The error message: `Bin labels must be one fewer than the number of bin edges` should be self-explanatory... ;) — MaxU - stand with Ukraine, Aug 19 '16 at 07:27
I guess i have understood the mistake will make changes and check again. May i request for the edit section solution too? Please! — Sitz Blogz, Aug 19 '16 at 08:23
The suggestion worked great and now proceeding to the pairs of values for the same categorization. http://stackoverflow.com/questions/39044434/use-qcut-pandas-for-multiple-valuable-categorizing — Sitz Blogz, Aug 19 '16 at 17:09

Separate by threshold

1 Answers1