I've been trying to run some ML code but I keep faltering at the fitting stage after running my pipeline. I've looked around on various forums to not much avail. What I've discovered is that some people say you can't use LabelEncoder within a pipeline. I'm not sure how true that is. If anyone has any insights on the matter I'd be very happy to hear them.
I keep getting this error:
TypeError: fit_transform() takes 2 positional arguments but 3 were given
And so I'm not sure if the problem is from me or from python. Here's my code:
data = pd.read_csv("ks-projects-201801.csv",
index_col="ID",
parse_dates=["deadline","launched"],
infer_datetime_format=True)
var = list(data)
data = data.drop(labels=[1014746686,1245461087, 1384087152, 1480763647, 330942060, 462917959, 69489148])
missing = [i for i in var if data[i].isnull().any()]
data = data.dropna(subset=missing,axis=0)
le = LabelEncoder()
oe = OrdinalEncoder()
oh = OneHotEncoder()
y = [i for i in var if i=="state"]
y = data[var.pop(8)]
p,p.index = pd.Series(le.fit_transform(y)),y.index
q = pd.read_csv("y.csv",index_col="ID")["0"]
label_y = le.fit_transform(y)
x = data[var]
obj_feat = x.select_dtypes(include="object")
dat_feat = x.select_dtypes(include="datetime64[ns]")
dat_feat = dat_feat.assign(dmonth=dat_feat.deadline.dt.month.astype("int64"),
dyear = dat_feat.deadline.dt.year.astype("int64"),
lmonth=dat_feat.launched.dt.month.astype("int64"),
lyear=dat_feat.launched.dt.year.astype("int64"))
dat_feat = dat_feat.drop(labels=["deadline","launched"],axis=1)
num_feat = x.select_dtypes(include=["int64","float64"])
u = dict(zip(list(obj_feat),[len(obj_feat[i].unique()) for i in obj_feat]))
le_obj = [i for i in u if u[i]<10]
oh_obj = [i for i in u if u[i]<20 and u[i]>10]
te_obj = [i for i in u if u[i]>20 and u[i]<25]
cb_obj = [i for i in u if u[i]>100]
# Pipeline time
#Impute and encode
strat = ["constant","most_frequent","mean","median"]
sc = StandardScaler()
oh_unk = "ignore"
encoders = [LabelEncoder(),
OneHotEncoder(handle_unknown=oh_unk),
TargetEncoder(),
CatBoostEncoder()]
#num_trans = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[2])),
num_trans = Pipeline(steps=[("sc",sc)])
#obj_imp = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[1]))])
oh_enc = Pipeline(steps=[("oh_enc",encoders[1])])
te_enc = Pipeline(steps=[("te_enc",encoders[2])])
cb_enc = Pipeline(steps=[("cb_enc",encoders[0])])
trans = ColumnTransformer(transformers=[
("num",num_trans,list(num_feat)+list(dat_feat)),
#("obj",obj_imp,list(obj_feat)),
("onehot",oh_enc,oh_obj),
("target",te_enc,te_obj),
("catboost",cb_enc,cb_obj)
])
models = [RandomForestClassifier(random_state=0),
KNeighborsClassifier(),
DecisionTreeClassifier(random_state=0)]
model = models[2]
print("Check 4")
# Chaining it all together
run = Pipeline(steps=[("Transformation",trans),("Model",model)])
x = pd.concat([obj_feat,dat_feat,num_feat],axis=1)
print("Check 5")
run.fit(x,p)
It runs fine until run.fit where it throws an error. I'd love to hear any advice anyone might have, and any possible ways to resolve this problem would also be greatly appreciated! Thank you.