I'd like to use LogisticRegression to combine X features that are strings and floats. This question is similar to this question: Logistic regression on One-hot encoding
There is a comment:
I would like to add that your answer is partially correct. Indeed, if only LabelEncode the strings, and not one_hot encode them. That will create false results since some string will worth "more" than others. – Mornor Jun 27, 2017 at 7:19 2 If anyone is wondering what Mornor means, this is because label encode will be numerical values. Ex: France = 0, Italy = 1, etc. That means that some cities are worth more than others. With one-hot encoding each city has the same value: Ex: France = [1, 0], Italy = [0,1]. Also don't forget to the dummy variable trap algosome.com/articles/dummy-variable-trap-regression.html. – Juan Acevedo Jan 20,
However these are just comments. I would like to see the code that combines them as it's not intuitively obvious how to combine them.
Here is the code:
def build_model(results: List[Result]) -> Tuple[LogisticRegression, OneHotEncoder]:
home_names = np.array([r.fixture.home_team.name for r in results])
away_names = np.array([r.fixture.away_team.name for r in results])
home_goals = np.array([r.home_goals for r in results])
away_goals = np.array([r.away_goals for r in results])
home_spis = np.array([r.home_spi for r in results])
away_spis = np.array([r.away_spi for r in results])
home_imps = np.array([r.home_imp for r in results])
away_imps = np.array([r.away_imp for r in results])
team_names = np.array(list(home_names) + list(away_names)).reshape(-1, 1)
team_encoding = OneHotEncoder(sparse=False).fit(team_names)
encoded_home_names = team_encoding.transform(home_names.reshape(-1, 1))
encoded_away_names = team_encoding.transform(away_names.reshape(-1, 1))
team_spis = np.array(list(home_spis) + list(away_spis)).reshape(-1, 1)
home_spis_reshaped = np.array(list(home_spis) ).reshape(-1, 1)
away_spis_reshaped = np.array(list(away_spis) ).reshape(-1, 1)
x: NDArray[float64] = np.concatenate(
[encoded_home_names, encoded_away_names, home_spis_reshaped, away_spis_reshaped], 1) # type: ignore
y = np.sign(home_goals - away_goals)
model = LogisticRegression(penalty="l2", fit_intercept=False, multi_class="ovr", C=1)
model.fit(x, y)
return model, team_encoding
if n_features != self.n_features_in_:
> raise ValueError(
f"X has {n_features} features, but {self.__class__.__name__} "
f"is expecting {self.n_features_in_} features as input."
)
E ValueError: X has 1416 features, but LogisticRegression is expecting 1418 features as input.
../../env/lib/python3.10/site-packages/sklearn/base.py:400: ValueError
So it looks like I have to add the home/away spi float scores in before calling fit on the OneHotEncoder, but I'm unclear the best way to do this. Thanks
Solution based upon Alexander's help:
def build_model(results: List[Result]) -> Tuple[LogisticRegression, OneHotEncoder]:
home_names = np.array([r.fixture.home_team.name for r in results])
away_names = np.array([r.fixture.away_team.name for r in results])
home_goals = np.array([r.home_goals for r in results])
away_goals = np.array([r.away_goals for r in results])
home_spis = np.array([r.home_spi for r in results])
away_spis = np.array([r.away_spi for r in results])
home_imps = np.array([r.home_imp for r in results])
away_imps = np.array([r.away_imp for r in results])
team_names = np.array(list(home_names) + list(away_names)).reshape(-1, 1)
team_features = [home_names, away_names, home_spis, away_spis, home_imps, away_imps]
df = pd.DataFrame(team_features).transpose()
df.columns = ['home_team', 'away_team', 'home_spi', 'away_spi', 'home_importance', 'away_importance']
cat_columns = ["home_team", "away_team"]
model = LogisticRegression(penalty="l2", fit_intercept=False, multi_class="ovr", C=1)
team_encoding = OneHotEncoder(sparse=False).fit(team_names)
pipe = make_pipeline(
ColumnTransformer(
transformers=[
("encode", team_encoding, cat_columns),
],
remainder="passthrough"
),
SimpleImputer(),
model
)
y = np.sign(home_goals - away_goals)
pipe = pipe.fit(df, y)
return model, team_encoding