0

I'm really struggling with encoding categorical types. Given two DataFrames X_train and X_test I'm trying to encode all the zip codes. The greatest hindrance for me is to be able to encode values from both dataframes (as they vary to some extention) in the same fashion, so I thought of making a list of all possible zip code values and then use it to encode both series (as parts of DataFrames). Unfortunately, this doesn't work as an error AttributeError: 'numpy.ndarray' object has no attribute 'transform' appears. I'm running out of ideas.

X_train = X[['ticket_id','judgment_amount','zip_code']]
X_test = y[['ticket_id','judgment_amount','zip_code']]


Xtrain_zipcode =  X_train['zip_code'].dropna().unique().tolist()
Xtest_zipcode = X_test['zip_code'].dropna().unique().tolist()

zip_list = Xtrain_zip

for elem in Xtest_zipcode:
    if elem not in Xtrain_zipcode:
        zip_list.append(elem)

enc_zipc = LabelEncoder().fit(zip_list)
encoded = enc_zipc.transform(zip_list)
encoded.transform(X_train['zip_code'])

I have also read that LabelEncoder is not advisable while dealing with categorical features that are an input. What would you suggest? One-hot encoding?

thesecond
  • 362
  • 2
  • 9

1 Answers1

0

have you tried to concatenate train and test and then fit the encoder on the concatenated zip?

X_train = X[['ticket_id','judgment_amount','zip_code']]
X_test = y[['ticket_id','judgment_amount','zip_code']]
X_train['zip_code'] = X_train['zip_code'].astype(str)
X_test['zip_code'] = X_test['zip_code'].astype(str)

concat = pd.concat([X_train, X_test], axis=0, ignore_index=True)

enc_zipc = LabelEncoder()
enc_zipc.fit(concat['zip_code'])

X_train['zip_code'] = enc_zipc.transform(X_train['zip_code'])
X_test['zip_code'] = enc_zipc.transform(X_test['zip_code'])
Marco Cerliani
  • 21,233
  • 3
  • 49
  • 54