Need to compare data of 5000 houses to the top 5000 rows of original dataset

Question

i have a house pricing data set of 80000+ rows. I calculated, according to lattitude and longitude, each houses distance to a landmark in the city, for price detection. And, i found out that 5500+ is close to it, less than 2 km. , and now i want to know, if those close ones are also most expensive ones in my original dataset, or at least how many percent of it.

data['metro'] = data['locations'].apply(lambda x: any([a in str(x).lower() for a in ['metro', 'm.']]))

from math import sin, cos, sqrt, atan2, radians
R=6373.0

distance_to_iceriseher = []

for a in range(0, len(data)):
    lat1 = radians(data['latitude'][a])
    lon1 = radians(data['longitude'][a])
    lat2 = radians(40.3672364)
    lon2 = radians(49.8315896)

    dlon = lon2 - lon1 
    dlat = lat2 - lat1

    b = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(b), sqrt(1 - b))

    distance = R * c
    distance_to_iceriseher.append(round(distance, 5))

    distance_to_iceriseher_ser = pd.Series(distance_to_iceriseher)

print(len(distance_to_iceriseher))
print(data.loc[data['distance_to_iceriseher_ser'] < 2].count().tolist())

Result of the code below:

23406    2900000
67112    2800000
9840     2500000
46149    2500000
9800     2444000
          ...   
68144     112000
68585     110000
69029     110000
25459      85000
36668       1000

datax = pd.read_csv('binaaz_train.csv')
datax['distance_to_iceriseher_ser'] = pd.Series(distance_to_iceriseher)


top_5000 = datax[datax.distance_to_iceriseher_ser < 1].nlargest(5000, 'price')['price']
top_5000

I tried to extract most expensive 5000 houses from original dataset and compare it to, distance_to_iceriseher but converting it to dataframe didnt work out, neither i could find any way. So any help is appreciated.

So you are trying to compare `data` to `datax`? – Aug 20 '22 at 16:38 — , Aug 20 '22 at 16:38

score 0 · Answer 1 · answered Aug 20 '22 at 11:48

0

To assign a list (distance_to_iceriseher) to a dataframe column, you can assign it directly. Add column in dataframe from list

datax = pd.read_csv('binaaz_train.csv')

datax['distance_to_iceriseher_ser'] = distance_to_iceriseher

answered Aug 20 '22 at 11:48

Robin

65
2

Need to compare data of 5000 houses to the top 5000 rows of original dataset

1 Answers1