0

I am having problems running OLS in Python after reading in Stata data. Below are my codes and error message

import pandas as pd  # To read data
import numpy as np
import statsmodels.api as sm

gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())

The error message says:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

So any thoughts how to run this simple OLS?

Nick Cox
  • 35,529
  • 6
  • 31
  • 47
WaterWood
  • 1
  • 1
  • 1
    Can you share a small example of the data? – Arthur Morris Aug 31 '20 at 03:36
  • Certainly. You can download the data file [link](https://drive.google.com/file/d/1f5Ofs0LuwzNroToLM16sRdZsGj1qwCGW/view?usp=sharing) and reproduce the results using my codes above. – WaterWood Aug 31 '20 at 14:24
  • 1
    Please post data in body of question to avoid dead or forbidden links for current and future readers. – Parfait Aug 31 '20 at 20:31
  • So no link in the comment, but link in the body of my question? – WaterWood Aug 31 '20 at 23:01
  • Yes, this is what @Parfait is asking. It is important to pick a set of observations that reproduces your problem. The community has produced some guidance for this process [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Arthur Morris Sep 01 '20 at 05:51

2 Answers2

4

Your age variable contains a value "89 or older" which is causing it to be read as a string, which is not a valid input for statsmodels. You have to deal with this so it can be read as integer or float, for example like this:

gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
gss = gss[gss.age != '89 or older']
gss['age'] = gss.age.astype(float)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())

P.S. I'm not saying that dropping observations where age == "89 or older" is the best way. You'll have to decide how best to deal with this. If you want to have a categorical variable in your model you'll have to create dummies first.

EDIT: If your .dta file contains a numeric value with value labels, the value labels will be used as values by default causing it to be read as string. You can use convert_categoricals=False with pd.read_stata to read in the numeric values.

Wouter
  • 3,201
  • 6
  • 17
  • Thanks a lot for your help here. I am kind of confused. The issue is '89 or older is coded as 89' in Stata. 89 or older is the value lable. So after I use read_stata to read the Stata data into python, 89 with that label would become a string? Is there anyway to read in the age variable as numeric (e.g., any option in the read_stata)? Thanks a lot! – WaterWood Aug 31 '20 at 23:03
  • If a .dta file contains a variable with value labels, `pandas.read_stata` takes the value labels as the values for the DataFrame by default, see [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_stata.html). You can add the `convert_categoricals=False` option to read in the numeric values, which in this case actually appears to be a better solution. I'll add this to my answer. – Wouter Sep 01 '20 at 07:04
  • Adding `convert_categoricals=False` made it work! Fantastic! Thanks a lot! – WaterWood Sep 03 '20 at 21:38
0

An alternative second line of @Wouter's solution could be:

gss.loc[gss.age=='89 or older','age']='89'

See this discussion of replacing based on a condition for more details.

Of course, whether this replacement is appropriate depends on your use case.

Arthur Morris
  • 1,253
  • 1
  • 15
  • 21
  • Note that Wouter's question completely addresses the question you asked in the original post. I'd encourage you to mark that as the accepted answer. – Arthur Morris Sep 01 '20 at 01:53