0

I am trying to predict income (70000+) based on specific categorical fields (Sex and Highest Cert, dip, deg) based on python code below.

I created a range for the average income and then specified the specific range of income (70000+) I wanted to predict using (Sex and Highest Cert, dip, deg)

I have the following code. However, I get an error when I get to the One hot encoding part of the code. I am using python on visual studio. I have tried changing the categorical field to "Age", but it does not work. The code is below. Please how can I fix it? Thank you.

 # %% read dataframe from part1
import pandas as pd
 
df = pd.read_pickle("data.pkl")
 
#%%
import numpy as np
bins = [0, 30000, 50000, 70000, 100000, np.inf]
names = ['<30000', '30000-50000', '50000-70000', '70000-100000', '100000+']
 
df['Avg Emp Income Range'] = pd.cut(df['Avg Emp Income'], bins, labels=names)
 
#%% OHE of Avg empl income
for val in df["Avg Emp Income Range"].unique():
    df[f"Avg Emp Income Range_{val}"] = df["Avg Emp Income Range"] == val
 
#%% selecting data
x= ["Sex","Highest Cert,dip,deg"]
 
#%%
success = ["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]
y = success
 
# %% split into training / testing sets
from sklearn.model_selection import train_test_split
 
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=123)
 
#%%
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
 
enc = OneHotEncoder(handle_unknown="ignore")
ct = ColumnTransformer(
    [
        ("ohe", enc, ["Sex","Highest Cert,dip,deg",])
    ],
    remainder="passthrough",
)
 
x_train = ct.fit_transform(x_train)
x_test = ct.transform(x_test)

I get this error

Error: --------------------------------------------------------------------------- ValueError Traceback (most recent call last) c:\Users\maria\Documents\Project Capstone 2\Z NO\machine L.py in 42 ) 43 ---> 44 x_train = ct.fit_transform(x_train) 45 x_test = ct.transform(x_test)

c:\Users\maria\Documents\Project Capstone 2\Z NO\venv\lib\site-packages\sklearn\compose_column_transformer.py in fit_transform(self, X, y) 522 else: 523 self._feature_names_in = None --> 524 X = check_X(X) 525 # set n_features_in attribute 526 self._check_n_features(X, reset=True)

c:\Users\maria\Documents\Project Capstone 2\Z NO\venv\lib\site-packages\sklearn\compose_column_transformer.py in _check_X(X) 649 if hasattr(X, 'array') or sparse.issparse(X): 650 return X --> 651 return check_array(X, force_all_finite='allow-nan', dtype=np.object) 652 653

c:\Users\maria\Documents\Project Capstone 2\Z NO\venv\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs) 70 FutureWarning) 71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)}) ---> 72 return f(**kwargs) 73 return inner_f 74

c:\Users\maria\Documents\Project Capstone 2\Z NO\venv\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 621 "Reshape your data either using array.reshape(-1, 1) if " 622 "your data has a single feature or array.reshape(1, -1) " --> 623 "if it contains a single sample.".format(array)) 624 625 # in the future np.flexible dtypes will be handled like object dtypes

ValueError: Expected 2D array, got 1D array instead: array=['Sex']. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

max
  • 3,915
  • 2
  • 9
  • 25
Mariaoye
  • 3
  • 1

2 Answers2

1

You say, that you trainings data is

x = ["Sex","Highest Cert,dip,deg"]
y = ["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]
# splitting the data
x_train = train_test_split(x, y, random_state=123)

but what you encode is an array of a tupel

ct = ColumnTransformer([("ohe", enc, ["Sex","Highest Cert,dip,deg",])])

Now, if you call ct.fit_transform(x_train), the encoded object cf expects the input to be of 1D size (because it was just encoded to be an array of tuples) but your data is a 2D-array, which raises the exception.

However, I assume that you rather wanted to use x and y as keys for the data matrix df:

x = ["Sex","Highest Cert,dip,deg"]
y = ["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]
# splitting the data
x_train = train_test_split(df[x], df[y], random_state=123)

It is helpful to use the debugging option or the execute the code step-wise in iPython so that you can keep track of the size of the arrays and if the code actually does what you thought it should be doing.

max
  • 3,915
  • 2
  • 9
  • 25
  • Thanks @max . I tried this it worked but finally gave me the error : Target is multilabel-indicator but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted', 'samples'] I believe this is because I am using two columns for the y. I might try changing the ranges and see what happens. Any other ideas? – Mariaoye Nov 13 '20 at 03:05
  • yes and no. You are not calling a measure here (you are importing `precision_score` though, which may rise this error. See [here](https://stackoverflow.com/questions/52269187/facing-valueerror-target-is-multiclass-but-average-binary). This is not part of your question anymore. Anyway, just add the parameter `average=None` to the function, which raised this error and you'll get the score for both labels. BTW: It is much easier if you post the entire error message. I see that this is difficult in a comment – max Nov 13 '20 at 06:33
  • Thanks so much this worked. Just that I am getting zero precision and recall but I will post that as a new question. – Mariaoye Nov 14 '20 at 18:13
0

Your x and y data are not set correct: You are just using the column headers as lists instead of the dataframe's values. Try setting:

x = df[["Sex","Highest Cert,dip,deg"]]
y = df[["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]]
rftr
  • 1,185
  • 2
  • 10
  • 19
  • Thanks @rftr for your response. I tried this but it gave me the error : "Specifying the columns using strings is only supported for pandas DataFrames". I want to believe this is because I used numpy to create the ranges . I am pretty new to python and machine learning. I will try and look for how to create ranges in pandas, maybe that will help? Any other ideas pls? Thanks – Mariaoye Nov 13 '20 at 03:15
  • Okay. This error is raised by the column transformer which can only handle column strings when dealing with pandas DataFrames. Therefore you don't have to use a numpy array for `x` and `y`. I edited my answer. – rftr Nov 13 '20 at 06:09
  • This should normally work but I got that target is multilabel error. Once I changed precision to micro, it was fine. Thanks so much – Mariaoye Nov 14 '20 at 18:14