0

I am trying to run logistic regression on my y and x, however I keep getting the error 'Setting an array element with a sequence'. I think I might have to reshape my data, however I am not too sure what dimensions of array should be used. I tried reshaping k to (3, 1) and g to (4000000, 1) but it still did not work. I have attached my code below (without reshaping arrays). The data is a netcdf file. Appreciate if anyone can help, thank you. Screenshot of final_df.head(5)

import pandas as pd 
import geopandas as gpd
from netCDF4 import Dataset
from osgeo import gdal, ogr
f = Dataset('C:\\filename.nc', 'r')
#Extract pixel 'coords'

B01_DATA = f.variables['B01_DATA'][:]
B02_DATA = f.variables['B02_DATA'][:]
VIS_DATA = f.variables['VIS_DATA'][:]


#these are look-up tables
B01_LUT = f.variables['B01_LUT'][:]
B02_LUT = f.variables['B02_LUT'][:]
VIS_LUT = f.variables['VIS_LUT'][:]

min_lat = -15
min_lon = 90
res = 0.009 #resolution 

import numpy as np
lst = []
for x in range(0, 2000): 
    for y in range(0,2000):  
        B01 = (B01_LUT[B01_DATA[x,y]]) 
        B02 = (B02_LUT[B02_DATA[x,y]])
        VIS = (VIS_LUT[VIS_DATA[x,y]])

        k = np.array([B01,B02,VIS], dtype=np.float32)
        lst.append(k)
df = pd.DataFrame()
df['x'] = lst 
#print(df)     
lst1 = []
lst2=[]
for x in range(0, 2000): 
    for y in range(0,2000):  
        lon = min_lat + x*res 
        lat = min_lon + y*res
        lst1.append(lat)
        lst2.append(lon)
df1 = pd.DataFrame()
df1['Latitude'] = lst1
df1['Longitude'] = lst2
df1['Coords'] = list(zip(df1.Latitude, df1.Longitude))

print(df1)

import shapefile
from shapely.geometry import shape, Point

# read your shapefile
r = shapefile.Reader("C:\\shapefile.shp")

# get the shapes
shapes = r.shapes()

# build a shapely polygon from your shape
hold = []
for k in range(20,22): #I am only taking a subset of layers in the polygon
    polygon = shape(shapes[k])
    for x in df1.Coords: 
        if polygon.contains(Point(x)):
            hold.append(x) 

#print(len(hold))

g = np.where(df1['Coords'].isin(hold), 1,0)

g.tolist()

df1['y'] = g 

final_df = df.join(df1)
print(final_df)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X = final_df.X
y = final_df.y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train) 

This is the full error message:

ValueError                                Traceback (most recent call last)
<ipython-input-12-f189af4819e6> in <module>()
      2 from sklearn.linear_model import LogisticRegression
      3 logmodel = LogisticRegression()
----> 4 logmodel.fit(X_train, y_train)

~\Anaconda2\envs\python3env\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y, sample_weight)
   1214 
   1215         X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype,
-> 1216                          order="C")
   1217         check_classification_targets(y)
   1218         self.classes_ = np.unique(y)

~\Anaconda2\envs\python3env\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    571     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    572                     ensure_2d, allow_nd, ensure_min_samples,
--> 573                     ensure_min_features, warn_on_dtype, estimator)
    574     if multi_output:
    575         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~\Anaconda2\envs\python3env\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    431                                       force_all_finite)
    432     else:
--> 433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:

ValueError: setting an array element with a sequence.
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Kalamazoo
  • 111
  • 1
  • 10

1 Answers1

0

Looks to me like your error is caused by the fact that you have columns which contain lists which isn't a valid input format to a model. Try something like this (taken from here: Pandas split column of lists into multiple columns):

X = pd.DataFrame(final_df.X.values.tolist(), columns=['x1','x2','x3'])

This should return a three column dataframe with your co-ordinates

Sven Harris
  • 2,884
  • 1
  • 10
  • 20