0

I'm trying to write a program in python that uses a query in SQL to collect data and make a regression model. When I try to actually create the model, however, it gives me this error.

import pyodbc
import pandas
import statsmodels.api as sm
import numpy as np

server = 'ludsampledb.database.windows.net'
database = 'SampleDB'
username = 'sampleadmin'
password = '+U9Ly9/p'   
driver = '{ODBC Driver 17 for SQL Server}'
table = 'GooglePlayStore'

conn = pyodbc.connect('DRIVER='+driver+';SERVER='+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password)

sql1 = "SELECT ISNULL((CASE WHEN ISNUMERIC(Rating) = 1 THEN CONVERT(float, Rating) ELSE 0 END), 0) AS 'Rating', ISNULL((CASE WHEN ISNUMERIC(Reviews) = 1 THEN CONVERT(float, Reviews) ELSE 0 END), 0) AS 'Reviews', ISNULL((CASE WHEN ISNUMERIC(SUBSTRING(Size, 0, LEN(Size))) = 1 THEN CONVERT(float, SUBSTRING(Size, 0, LEN(Size))) ELSE 0 END), 0) AS 'Size', ISNULL((CASE WHEN ISNUMERIC(REPLACE(Price, '$', '')) = 1 THEN CONVERT(float, REPLACE(Price, '$', '')) ELSE 0 END), 0) AS 'Price', ISNULL((CASE WHEN ISNUMERIC(REPLACE(SUBSTRING(Installs, 0, LEN(Installs)), ',', '')) = 1 THEN CONVERT(float, REPLACE(SUBSTRING(Installs, 0, LEN(Installs)), ',', '')) ELSE 0 END), 0) AS 'Installs' FROM GooglePlayStore"

data = pandas.read_sql(sql1,conn)

x = np.array([data["Rating"], data["Size"], data["Installs"], data["Price"]]).reshape(-1, 1)
x = sm.add_constant(x)
print(x.shape)
y = np.array([data['Reviews']]).reshape(-1, 1)
print(y.shape)

fit = sm.OLS(y, x).fit() #This is where the error is occurring

I'm pretty sure that I know what is going wrong, but I have no idea how to fix it. I've tried several things already, but none so far have worked.

Kyle Roberts
  • 9
  • 1
  • 6
  • Why the `reshape`? What are outputs of `.shape`? – Parfait Mar 18 '21 at 00:50
  • @Parfait I did `reshape` mostly just to see if it even did anything. This was one of the solutions someone else proposed for this problem on someone else's post and I wasn't sure if it applied to mine. The output for `.shape` is `(21682, 2)` for x and `(10841, 1)` for y. – Kyle Roberts Mar 18 '21 at 06:38
  • Can you post a sample of several rows of what comes out of `read_sql` for [reproducible example](https://stackoverflow.com/q/20109391/1422451)? – Parfait Mar 19 '21 at 02:58

2 Answers2

0

Avoid reshape calls which may be transposing data causing mismatched sizes. Pandas DataFrames and Series (each column of DataFrame) are extensions of numpy 2D and 1D arrays. Also, sm.OLS can directly receive pandas objects.

x = data.reindex(["Rating", "Size", "Installs", "Price"], axis="columns")
x = sm.add_constant(x) 
y = data['Reviews']

fit = sm.OLS(y, x).fit()

Consider even R-style formula (constant automatically added) using lower case ols:

fit = sm.ols(formula="Reviews ~ Rating + Size + Installs + Price", data=data).fit()
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • When I use this formatting for the x and the y, the error is `ValueError: shapes (1,4) and (5,) not aligned: 4 (dim 1) != 5 (dim 0)`. The shapes also didn't look close, so I tried doing a reshape on the y value by using the same `np.array` and when I checked the shape for x and y it gave me: `(10841, 5) (10841, 1)`. I feel like this is closer than before, but I'm not sure what the 5 and the 1 are and how to make them the same value. – Kyle Roberts Mar 18 '21 at 06:44
0

I figured out what was going wrong. It was a little bit about how I had written my x and y values, and I had to get rid of my constant variable:

sql1 = "SELECT ISNULL((CASE WHEN ISNUMERIC(Rating) = 1 THEN CONVERT(float, Rating) ELSE 0 END), 0) AS 'Rating', ISNULL((CASE WHEN ISNUMERIC(Reviews) = 1 THEN CONVERT(float, Reviews) ELSE 0 END), 0) AS 'Reviews', ISNULL((CASE WHEN ISNUMERIC(SUBSTRING(Size, 0, LEN(Size))) = 1 THEN CONVERT(float, SUBSTRING(Size, 0, LEN(Size))) ELSE 0 END), 0) AS 'Size', ROUND(ISNULL((CASE WHEN ISNUMERIC(REPLACE(Price, '$', '')) = 1 THEN CONVERT(float, REPLACE(Price, '$', '')) ELSE 0 END), 0), 0) AS 'Price', (ISNULL((CASE WHEN ISNUMERIC(REPLACE(SUBSTRING(Installs, 0, LEN(Installs)), ',', '')) = 1 THEN CONVERT(float, REPLACE(SUBSTRING(Installs, 0, LEN(Installs)), ',', '')) ELSE 0 END), 0) / 10) AS 'Installs' FROM GooglePlayStore"
data = pandas.read_sql(sql1,conn)

x = data.reindex(["Price", "Installs", "Size", "Rating"], axis="columns")
y = np.array([data['Reviews']]).reshape(-1, 1)

fit = sm.OLS(y, x).fit()
Kyle Roberts
  • 9
  • 1
  • 6
  • What if you used `y = data[Reviews]` which would be a pandas Series (or 1D numpy array)? Did formula style above not work (be sure to use latest `statsmodels.api`)? – Parfait Mar 19 '21 at 18:01
  • @Parfait it didn't work for me, I'm not sure if it might be because of the version of `statsmodels.api` I'm using or something else, but the dimensions weren't adding up with the x, but it only worked when I did the `.reshape(-1,1)` – Kyle Roberts Mar 20 '21 at 07:14