i have a house pricing data set of 80000+ rows. I calculated, according to lattitude and longitude, each houses distance to a landmark in the city, for price detection. And, i found out that 5500+ is close to it, less than 2 km. , and now i want to know, if those close ones are also most expensive ones in my original dataset, or at least how many percent of it.
data['metro'] = data['locations'].apply(lambda x: any([a in str(x).lower() for a in ['metro', 'm.']]))
from math import sin, cos, sqrt, atan2, radians
R=6373.0
distance_to_iceriseher = []
for a in range(0, len(data)):
lat1 = radians(data['latitude'][a])
lon1 = radians(data['longitude'][a])
lat2 = radians(40.3672364)
lon2 = radians(49.8315896)
dlon = lon2 - lon1
dlat = lat2 - lat1
b = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(b), sqrt(1 - b))
distance = R * c
distance_to_iceriseher.append(round(distance, 5))
distance_to_iceriseher_ser = pd.Series(distance_to_iceriseher)
print(len(distance_to_iceriseher))
print(data.loc[data['distance_to_iceriseher_ser'] < 2].count().tolist())
Result of the code below:
23406 2900000
67112 2800000
9840 2500000
46149 2500000
9800 2444000
...
68144 112000
68585 110000
69029 110000
25459 85000
36668 1000
datax = pd.read_csv('binaaz_train.csv')
datax['distance_to_iceriseher_ser'] = pd.Series(distance_to_iceriseher)
top_5000 = datax[datax.distance_to_iceriseher_ser < 1].nlargest(5000, 'price')['price']
top_5000
I tried to extract most expensive 5000 houses from original dataset and compare it to, distance_to_iceriseher but converting it to dataframe didnt work out, neither i could find any way. So any help is appreciated.