2

I've been locked out for days trying to create a new categorical variable from the aggregation of values ​​from a sequential variable. I'm migrating syntax scripts from SPSS to Python using Pandas In SPSS I used the following command to do what I need:

RECODE QTD_FTE_RAZAO (0=5) (1 thru 3000 = 0) (3000.01 thru 4000 = 1) (4000.01 thru 5000 = 2) (5000.01 thru 10000 = 3) (10000.01 thru 15000 = 4) (15000.01 thru 9999999999 = 5)
INTO COD_FTE_RAZAO.
EXECUTE.

I looked at the pandas cut function, but it doesn't work because it creates text objects and I honestly couldn't write the parameters I need.

I need to turn it into values ​​because I will create an index. I will create 5 variable categories and in the end I will add the values ​​and create an index. That's why I need to turn these ranges into numbers.

The easiest way to do this would be to do a function?

I'm completely lost, could someone help me please?

Heavy Hammer
  • 291
  • 3
  • 11
  • 1
    how come `pd.cut` doesn't work? you might want to look at [`np.select`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.select.html), too. Also, please see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – help-ukraine-now Aug 09 '19 at 21:07
  • Because I need to put the values that I need. For exemple if 10 until 20 = 5 if 50 until 100 = 1. Thank you for the advice – Heavy Hammer Aug 09 '19 at 21:13
  • couldn't you use both `bins` and `labels` parameters? you first specify bin values, then - labels to returned bins. `pd.cut(df[col], bins=[1, 5, 10], labels=[3, 18])` (you could assign any label you want for an interval range) – help-ukraine-now Aug 09 '19 at 21:20

1 Answers1

2

Using hand-generated data, pd.cut with explicit bins turns val into bucket categories.

df = pd.DataFrame([{'val': 5}, {'val': 500},{'val': 5000},{'val': 2000},{'val': 3500},{'val': 6000},{'val': 12000},{'val': 500000}])
df['vars'] = pd.cut(df['val'],bins=[1, 3000, 4000, 5000, 10000, 15000, float('inf')])
df

The result is the first two columns of the table below. If you want to turn them into integers of some sort, you can map the categories in index order with df['codes'] = df.vars.cat.codes.

      val                vars  codes
0       5       (1.0, 3000.0]      0
1     500       (1.0, 3000.0]      0
2    5000    (4000.0, 5000.0]      2
3    2000       (1.0, 3000.0]      0
4    3500    (3000.0, 4000.0]      1
5    6000   (5000.0, 10000.0]      3
6   12000  (10000.0, 15000.0]      4
7  500000      (15000.0, inf]      5

As user political scientist commented, if you prefer not having an intermediate column and want whatever labels you desire that are not necessarily the integer indices of those bins, pass labels=[0, 1, 2, 3, 4, 5] (or a list containing the appropriate labels) to pd.cut.

ifly6
  • 5,003
  • 2
  • 24
  • 47
  • Thank you, I did this: index['cod_fte_razao'] = pd.cut(index['qtd_fte_razao'],bins=[1, 3000, 4000, 5000, 10000, 15000, float('inf')], labels = [5,0,1,2,3,4,5]) print(index.head()) and I got this error: ValueError: Bin labels must be one fewer than the number of bin edges I guess that is because I used the same code two times. – Heavy Hammer Aug 09 '19 at 23:17
  • 1
    It's because you provided the wrong number of labels to bins. – ifly6 Aug 10 '19 at 00:10
  • Thank you, yes, you were right, but now i've got this problem. **Categorical categories must be unique**. Actually, for the calculation of my index to work, I need to put the first category and the last one with the same value. Is there another solution? – Heavy Hammer Aug 11 '19 at 21:34
  • 1
    First, think about what your'e doing and whether you really need to have the same value for the first and last categories, because (1) your original code doesn't have that and (2) I can't think of any reason to do that. Solution however, is to just replace the values after the categorisation. – ifly6 Aug 12 '19 at 02:18