0

I am new to data science and was reading a paper about predicting housing prices with latitude and longitude (OECD statistics working papers 2011/01 Hedonic price indexes for housing by Robert Hill if anyone is interested). The author suggested that a popular method in econometrics is to create 'a matrix of distances between all properties in the data set' and then 'use methods developed in the spatial econometrics literature to allow for spatial dependence in the estimated model'.

I searched through StackOverflow and came across a similar question Creating a Distance Matrix?. I used the simplest method suggested because I'm relatively new to python.

To test the code, I created a new data frame with just 10 rows of 'latitude' and 'longitude' only.

df=df[['lat','long']]
df_new=df.head(10)

I was planning on creating the matrix then joining the matrix back to the original dataframe.

pd.DataFrame(distance_matrix(df_new.values, df_new.values))

everything seemed fine, but when I tried to run the code with all 21,000 data points, the kernel crashed (which makes sense).

so I was wondering 1) is there a way around this? 2) is this an appropriate method for large datasets?

Rbcc
  • 21
  • 1

0 Answers0