I have a pandas DataFrame whose rows contain data of multiple types. I want to fit a model based on this data using statsmodels.formula.api
and then make some predictions. For my application I want to make predictions a single row at a time. If I do this naively I get AttributeError: 'numpy.float64' object has no attribute 'log'
for the reason described in this answer. Here's some sample code:
import string
import random
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
# Generate an example DataFrame
N = 100
z = np.random.normal(size=N)
u = np.random.normal(size=N)
w = np.exp(1 + u + 2*z)
x = np.exp(z)
y = np.log(w)
names = ["".join(random.sample(string.lowercase, 4)) for lv in range(N)]
df = pd.DataFrame({"x": x, "y": y, "names": names})
reg_spec = "y ~ np.log(x)"
fit = smf.ols(reg_spec, data=df).fit()
series = df.iloc[0] # In reality it would be `apply` extracting the rows one at a time
print(series.dtype) # gives `object` if `names` is in the DataFrame
print(fit.predict(series)) # AttributeError: 'numpy.float64' object has no attribute 'log'
The problem is that apply
feeds me rows as Series
, not DataFrame
s, and because I'm working with multiple types, the Series
have type object
. Sadly np.log
doesn't like Series
of object
s even if all the object
s are in fact float
s. Swapping apply
for transform
doesn't help. I could create an intermediate DataFrame with only numeric columns or change my regression specification to y ~ np.log(x.astype('float64'))
. In the context of a larger program with a more complicated formula these are both pretty ugly. Is there a cleaner approach I'm missing?