statsmodels has trouble predicting on formulas using functions like log on rows of heterogeneous type

Question

I have a pandas DataFrame whose rows contain data of multiple types. I want to fit a model based on this data using statsmodels.formula.api and then make some predictions. For my application I want to make predictions a single row at a time. If I do this naively I get AttributeError: 'numpy.float64' object has no attribute 'log' for the reason described in this answer. Here's some sample code:

import string
import random
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd

# Generate an example DataFrame
N = 100
z = np.random.normal(size=N)
u = np.random.normal(size=N)
w = np.exp(1 + u + 2*z)
x = np.exp(z)
y = np.log(w)
names = ["".join(random.sample(string.lowercase, 4)) for lv in range(N)]
df = pd.DataFrame({"x": x, "y": y, "names": names})

reg_spec = "y ~ np.log(x)"
fit = smf.ols(reg_spec, data=df).fit()
series = df.iloc[0]  # In reality it would be `apply` extracting the rows one at a time
print(series.dtype)  # gives `object` if `names` is in the DataFrame
print(fit.predict(series))  # AttributeError: 'numpy.float64' object has no attribute 'log'

The problem is that apply feeds me rows as Series, not DataFrames, and because I'm working with multiple types, the Series have type object. Sadly np.log doesn't like Series of objects even if all the objects are in fact floats. Swapping apply for transform doesn't help. I could create an intermediate DataFrame with only numeric columns or change my regression specification to y ~ np.log(x.astype('float64')). In the context of a larger program with a more complicated formula these are both pretty ugly. Is there a cleaner approach I'm missing?

Stef · Answer 1 · 2019-08-26T19:27:13.947

1

Although you said you don't want to create an intermediate DataFrame with only numeric columns because it's pretty ugly, I think using select_dtypes to create a numbers-only subset of your Series on the fly is quite elegant and doesn't involve large code modifications:

series = df.select_dtypes(include='number').iloc[0]

edited Aug 26 '19 at 19:27

answered Aug 26 '19 at 07:18

Stef

28,728
2
24
52

Does this work for you? On my end, i get: `PatsyError: Number of rows mismatch between data argument and np.log(x) (2 versus 100) y ~ np.log(x) ^^^^^^^^^` – Itamar Mushkin Aug 26 '19 at 07:29
@ItamarMushkin: yes, it works for me (pandas 0.25.1). Output `float64 0 -3.020539 dtype: float64`. I only had to change `string.lowercase` to `string.ascii_lowercase` which is not related to the issue in question. – Stef Aug 26 '19 at 07:33
It works for me under 0.23.4. I didn't know about the `df.select_dtypes` method, which is much nicer than supplying my own list. – kuzzooroo Aug 26 '19 at 13:06

score 0 · Answer 2 · answered Aug 28 '19 at 02:25

0

Another solution that dawned on me as I was doing some other work is to convert the Series that apply gives me into a DataFrame consisting of a single row. This works:

row_df = pd.DataFrame([series])
print(fit.predict(row_df))

answered Aug 28 '19 at 02:25

kuzzooroo

6,788
11
46
84

You can also turn a series into a dataframe with `series.to_frame().T` – Itamar Mushkin Aug 28 '19 at 05:57

statsmodels has trouble predicting on formulas using functions like log on rows of heterogeneous type

2 Answers2