1

So I would like to have my explanatory variable air quality and smoking status and smoking status squared as part of my linear regression. So I can easily get around this by adding the values in the .csv file I am reading from but I would like to manipulate it using python. Is there a way to manipulate smoking status by squaring it and utilizing part of the Mulitple Line Regression? My csv file only have 3 column consisting of air quality, smoking status, and asthma.

x = df[['Air_quality', 'Smoking_Status']]
y = df['Asthma_Death_Rate']

x = sm.add_constant(x)

est = sm.OLS(y,x).fit()
ALollz
  • 57,915
  • 7
  • 66
  • 89
nssleep
  • 19
  • 1
  • something like `for col in df: df[col+"_squared"] = df[col]*df[col]`? – pault Oct 23 '18 at 19:32
  • 1
    What type of data is `smoking_status`, what does it represent? I'd assume that it is categorical, i.e. someone smokes or they don't. In which case squaring it doesn't really make sense... – smj Oct 23 '18 at 19:38
  • 1
    An aside, but `sm.add_constant` is horribly slow for large data. Easier to just add it yourself with `x['const'] = 1` – ALollz Oct 23 '18 at 19:43

2 Answers2

0

To square smoking status in your dataframe:

df['Smoking_Status'] = df['Smoking_Status']**2

Or the slower looping version below

df['Smoking_Status'] = df['Smoking_Status'].apply(lambda x: x * x)

See How to use Apply for more detail. This will overwrite the values of smoking status in your dataframe.

Cody Glickman
  • 514
  • 1
  • 8
  • 30
  • 2
    `Series.apply` is essentially a slow for loop. Most simple algebraic calculations can be performed as a vectorized operation, in this case `df['Smoking_Status'] = df['Smoking_Status']**2` – ALollz Oct 23 '18 at 19:42
  • I actually tried this but I did (...x:x^2) and I guess that's no good. This fixed it! – nssleep Oct 23 '18 at 20:00
0

Use the formula api. With patsy notation, you should trivially be able to square a term, but something isn't working for me. Still it accepts functions; in this case we square using numpy.power.

import statsmodels.formula.api as smf
import numpy as np

mod = smf.ols('Asthma_Death_Rate ~ Air_quality + np.power(Smoking_Status, 2)', data=df).fit()

Sample Data:

import pandas as pd
np.random.seed(123)
s = 100

df = pd.DataFrame({'Air_quality': np.random.randint(1, 20, s),
                   'Smoking_Status': np.arange(0, s, 1) + np.random.normal(size=s),
                   'Asthma_Death_Rate': np.arange(0, s, 1)**2})

Output: part of mod.summary()

===============================================================================================
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept                       3.4253     33.039      0.104      0.918     -62.148      68.999
Air_quality                     3.2522      2.721      1.195      0.235      -2.148       8.653
np.power(Smoking_Status, 2)     0.9916      0.005    193.833      0.000       0.981       1.002

As designed, Asthma_Death_Rate is very-well correlated with Smoking_Status squared.

ALollz
  • 57,915
  • 7
  • 66
  • 89