I'm using LabelEncoder
and OneHotEncoder
from sklearn
in a Machine Learning project to encode the labels (country names) in the dataset. Everything works good and my model runs perfectly. The project is to classify whether a bank customer will continue with or leave the bank based on a number of features(data), including the customer's country.
My issue arises when I want to predict (classify) a new customer (one only). The data for the new customer is still not pre-processed (i.e., country names are not encoded). Something like the following:
new_customer = np.array([['France', 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])
In the online course, where I learn machine learning, the instructor opened the pre-processed dataset that included the encoded data and manually checked the code for France and updated it in the new_customer
, as the following:
new_customer = np.array([[0, 0, 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])
I believe that this is not practical, there must be a way to automatically encode France to the same code used in the original dataset, or at least a way to return a list of the countries and their encoded values. Manually encoding a label seems tedious and error-prone. So how can I automate this process, or generate the codes for the labels? Thanks in advance.