1

Can i go from 15 object variables to one final binary target variable?

Those 15 variables has ~10.000 different codes, my dataset is about 21.000.000 records. What im trying to do is at first replace the codes i want with 1 and the other with 0, then if one of fifteen variables is 1 the target variable will be 1, if all fifteen variables are 0 the target variable will be 0.

i have tried to work with to_replace, as_type, to_numeric, infer_objects with not good results,for example my dataset look like this head(5):

    D       P1    P2   P3    P4   P5   P6   P7   P8   P9   P10   P11   P12   P13   P14  P15
    41234   1234  4367 874   NAN  NAN  NAN  789  NAN  NAN  NAN   NAN   NAN   NAN   NAN  NAN 
    42345   7657  4367 874   NAN  NAN  NAN  789  NAN  NAN  NAN   NAN   NAN   NAN   NAN  NAN
    34212   7654  4347 474   NAN  NAN  NAN  789  NAN  NAN  NAN   NAN   NAN   NAN   NAN  NAN
    34212   8902  4317 374   NAN  452  NAN  719  NAN  NAN  NAN   NAN   NAN   NAN   NAN  NAN 
    19374   2564  4387 274   NAN  452  NAN  799  NAN  NAN  NAN   NAN   NAN   NAN   NAN  NAN

I want to transform all nan as 0, and selected codes with 1, so all the P1-P15 will be binary and the i will create a final P variable with them.

For example if P1-P15 have '3578','9732','4734'...(im using about 200 codes) i want to become 1. All the other values i want to become 0. D variable should stay as it is. The final dataset will be (D,P), then i will add the train variables

Any ideas? The following code gives me wrong results.

selCodes=['3722','66']
dfnew['P']=(dfnew.loc[:,'PR1':].astype(str).isin(selCodes).any(axis=1).astype(int))

Take a look at a test dataset(left), and new P(right).With the example code 3722 P should be 1.

Test Dataset New P-3722 is in selCodes, so P should be 1

Nick.P.
  • 15
  • 6
  • Not sure if I understood correctly `df.loc[:,'P1':].fillna(0).apply(np.sign)`? – Ch3steR Jun 03 '20 at 14:39
  • What are you referring by `selected codes`? It would be great if you include the desired output you want. – Shubham Sharma Jun 03 '20 at 14:49
  • I want to replace selected P codes with 1, for example '3578','9732','4734' ....the other will replace with 0. Then i will merge them to a final P-target variable – Nick.P. Jun 03 '20 at 14:52

1 Answers1

1

IIUC, Use, DataFrame.isin:

# example select codes
selCodes = ['1234', '9732', '719']

df['P'] = (
    df.loc[:, 'P1':].astype(str)
    .isin(selCodes).any(axis=1).astype(int)
)

df = df[['D', 'P']]

Result:

       D  P
0  41234  1
1  42345  0
2  34212  0
3  34212  1
4  19374  0
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • when i run it i get "AttributeError: 'str' object has no attribute 'isin'" , codes are not numbers they are objects, '1234','9732','719'... – Nick.P. Jun 03 '20 at 15:09
  • im not sure, when i use df.dtypes, it returns me objects (all Pn (1-15)) – Nick.P. Jun 03 '20 at 15:18
  • this works better, but i need a condition on the code, my first step is to make all the selCodes =1 and all the others 0 then if a line has one 1 the target is 1 else if a line has all 0 then the target is 0... – Nick.P. Jun 03 '20 at 16:12
  • @Nick.P. Its exactly doing what you require ' make all the selCodes =1 and all the others 0 then if a line has one 1 the target is 1 else if a line has all 0 then the target is 0..' – Shubham Sharma Jun 03 '20 at 16:15
  • array([nan, 3722.0, 3722.0, ..., '5491', '9904', '8872'], dtype=object), the dataframe didnt became 0 and 1 – Nick.P. Jun 03 '20 at 16:57
  • As you can see it contains `3722.0` and not `3722`. – Shubham Sharma Jun 03 '20 at 17:04
  • Yes my friend, thanks a lot, i should see the codes again also ... im new in python. – Nick.P. Jun 03 '20 at 18:21