Why do I get 1D array instead of 2D array Index error for Machine Learning

Question

I am trying to predict income (70000+) based on specific categorical fields (Sex and Highest Cert, dip, deg) based on python code below.

I created a range for the average income and then specified the specific range of income (70000+) I wanted to predict using (Sex and Highest Cert, dip, deg)

I have the following code. However, I get an error when I get to the One hot encoding part of the code. I am using python on visual studio. I have tried changing the categorical field to "Age", but it does not work. The code is below. Please how can I fix it? Thank you.

 # %% read dataframe from part1
import pandas as pd
 
df = pd.read_pickle("data.pkl")
 
#%%
import numpy as np
bins = [0, 30000, 50000, 70000, 100000, np.inf]
names = ['<30000', '30000-50000', '50000-70000', '70000-100000', '100000+']
 
df['Avg Emp Income Range'] = pd.cut(df['Avg Emp Income'], bins, labels=names)
 
#%% OHE of Avg empl income
for val in df["Avg Emp Income Range"].unique():
    df[f"Avg Emp Income Range_{val}"] = df["Avg Emp Income Range"] == val
 
#%% selecting data
x= ["Sex","Highest Cert,dip,deg"]
 
#%%
success = ["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]
y = success
 
# %% split into training / testing sets
from sklearn.model_selection import train_test_split
 
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=123)
 
#%%
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
 
enc = OneHotEncoder(handle_unknown="ignore")
ct = ColumnTransformer(
    [
        ("ohe", enc, ["Sex","Highest Cert,dip,deg",])
    ],
    remainder="passthrough",
)
 
x_train = ct.fit_transform(x_train)
x_test = ct.transform(x_test)

I get this error

Error: --------------------------------------------------------------------------- ValueError Traceback (most recent call last) c:\Users\maria\Documents\Project Capstone 2\Z NO\machine L.py in 42 ) 43 ---> 44 x_train = ct.fit_transform(x_train) 45 x_test = ct.transform(x_test)

c:\Users\maria\Documents\Project Capstone 2\Z NO\venv\lib\site-packages\sklearn\compose_column_transformer.py in fit_transform(self, X, y) 522 else: 523 self._feature_names_in = None --> 524 X = check_X(X) 525 # set n_features_in attribute 526 self._check_n_features(X, reset=True)

c:\Users\maria\Documents\Project Capstone 2\Z NO\venv\lib\site-packages\sklearn\compose_column_transformer.py in _check_X(X) 649 if hasattr(X, 'array') or sparse.issparse(X): 650 return X --> 651 return check_array(X, force_all_finite='allow-nan', dtype=np.object) 652 653

c:\Users\maria\Documents\Project Capstone 2\Z NO\venv\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs) 70 FutureWarning) 71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)}) ---> 72 return f(**kwargs) 73 return inner_f 74

c:\Users\maria\Documents\Project Capstone 2\Z NO\venv\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 621 "Reshape your data either using array.reshape(-1, 1) if " 622 "your data has a single feature or array.reshape(1, -1) " --> 623 "if it contains a single sample.".format(array)) 624 625 # in the future np.flexible dtypes will be handled like object dtypes

ValueError: Expected 2D array, got 1D array instead: array=['Sex']. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

score 1 · Accepted Answer · answered Nov 12 '20 at 06:53

1

You say, that you trainings data is

x = ["Sex","Highest Cert,dip,deg"]
y = ["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]
# splitting the data
x_train = train_test_split(x, y, random_state=123)

but what you encode is an array of a tupel

ct = ColumnTransformer([("ohe", enc, ["Sex","Highest Cert,dip,deg",])])

Now, if you call ct.fit_transform(x_train), the encoded object cf expects the input to be of 1D size (because it was just encoded to be an array of tuples) but your data is a 2D-array, which raises the exception.

However, I assume that you rather wanted to use x and y as keys for the data matrix df:

x = ["Sex","Highest Cert,dip,deg"]
y = ["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]
# splitting the data
x_train = train_test_split(df[x], df[y], random_state=123)

It is helpful to use the debugging option or the execute the code step-wise in iPython so that you can keep track of the size of the arrays and if the code actually does what you thought it should be doing.

answered Nov 12 '20 at 06:53

max

3,915
2
9
25

Thanks @max . I tried this it worked but finally gave me the error : Target is multilabel-indicator but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted', 'samples'] I believe this is because I am using two columns for the y. I might try changing the ranges and see what happens. Any other ideas? – Mariaoye Nov 13 '20 at 03:05
yes and no. You are not calling a measure here (you are importing `precision_score` though, which may rise this error. See [here](https://stackoverflow.com/questions/52269187/facing-valueerror-target-is-multiclass-but-average-binary). This is not part of your question anymore. Anyway, just add the parameter `average=None` to the function, which raised this error and you'll get the score for both labels. BTW: It is much easier if you post the entire error message. I see that this is difficult in a comment – max Nov 13 '20 at 06:33
Thanks so much this worked. Just that I am getting zero precision and recall but I will post that as a new question. – Mariaoye Nov 14 '20 at 18:13

rftr · Answer 2 · 2020-11-13T06:09:51.700

0

Your x and y data are not set correct: You are just using the column headers as lists instead of the dataframe's values. Try setting:

x = df[["Sex","Highest Cert,dip,deg"]]
y = df[["Avg Emp Income Range_70000-100000","Avg Emp Income Range_100000+"]]

edited Nov 13 '20 at 06:09

answered Nov 12 '20 at 06:47

rftr

1,185
2
10
19

Thanks @rftr for your response. I tried this but it gave me the error : "Specifying the columns using strings is only supported for pandas DataFrames". I want to believe this is because I used numpy to create the ranges . I am pretty new to python and machine learning. I will try and look for how to create ranges in pandas, maybe that will help? Any other ideas pls? Thanks – Mariaoye Nov 13 '20 at 03:15
Okay. This error is raised by the column transformer which can only handle column strings when dealing with pandas DataFrames. Therefore you don't have to use a numpy array for `x` and `y`. I edited my answer. – rftr Nov 13 '20 at 06:09
This should normally work but I got that target is multilabel error. Once I changed precision to micro, it was fine. Thanks so much – Mariaoye Nov 14 '20 at 18:14

Why do I get 1D array instead of 2D array Index error for Machine Learning

2 Answers2