1

I have an interesting problem. I have two files, NYPD_Motor_Collisions.csv has 1.2M lines and weatherfinal.txt has 109K lines. The objective is to merge the temp and prec data from weatherfinal.txt to the Collisions files as two columns based on the latitudes and longitudes. I wrote the following code using dataframe in pandas python.

from math import cos, asin, sqrt
import pandas as pd
import numpy as np
import os
import re
import datetime

def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
    return 12742 * asin(sqrt(a))

def closest(data, v):
    return min(data, key=lambda p: distance(v['lat'],v['lon'],p['lat'],p['lon']))

tempDataList = []

#v = {'lat': 39.7622290, 'lon': -86.1519750}
#print(closest(tempDataList, v))

print os.getcwd()

filed_ = open("weatherfinal.txt", 'r')
fileo_ = open("weatherfinal_updated.txt","w")
lines_ = filed_.readlines()
for line_ in lines_:
    outline = re.sub("  +"," ",line_)
    fileo_.write(outline + "\n")
fileo_.close()

df = pd.read_csv("NYPD_Motor_Vehicle_Collisions.csv")

colhead = np.append(df.columns.values,['TEMP', 'PREP'])

outdf = pd.DataFrame(columns=colhead)

df2 = pd.read_csv("weatherfinal_updated.txt",' ')
df2.set_index(['WBANNO', 'LST_DATE', 'LST_TIME'])

sensorIds = df2['WBANNO'].unique()

for ids_ in sensorIds:
    longitude = df2.loc[df2['WBANNO']==ids_,'LONGITUDE'].iloc[0]
    latitude = df2.loc[df2['WBANNO'] == ids_, 'LATITUDE'].iloc[0]
    tempDataList.append({'lat':latitude,'lon':longitude,'SENSORID': ids_ })

print tempDataList

for index, row in df.iterrows():
    lon_ = row['LONGITUDE']
    lat_ = row['LATITUDE']
    tdate = row['DATE']
    ttime = row['TIME']
    tcal = 5
    pcal = 0
    fwdate = datetime.datetime.strptime(str(tdate), '%m/%d/%Y').strftime('%Y%m%d')
    fwtime = datetime.datetime.strptime(str(ttime), '%H:%M').strftime('%H%M')
    ntime = float(fwtime) + float(100)
    closests_ = closest(tempDataList, {'lat':lat_,'lon':lon_})
    sensorid = closests_['SENSORID']
    usedSensorId = sensorid
    selectedWeatherRow = df2.loc[(df2.WBANNO == sensorid) & (df2.LST_DATE == float(fwdate)) & (df2.LST_TIME >= float(fwtime)) & (df2.LST_TIME < ntime) ,['T_CALC', 'P_CALC']]
    if len(selectedWeatherRow.index) == 0:
        for sensId in sensorIds:
            if sensId == sensorid:
                continue
            selectedWeatherRow = df2.loc[(df2.WBANNO == sensId) & (df2.LST_DATE == float(fwdate)) & (df2.LST_TIME >= float(fwtime)) & (df2.LST_TIME < ntime), ['T_CALC', 'P_CALC']]
            if len(selectedWeatherRow.index) == 0:
                continue
            else:
                tcal = selectedWeatherRow['T_CALC'].values[0]
                pcal = selectedWeatherRow['P_CALC'].values[0]
                usedSensorId = sensId
                break
    else:
        tcal = selectedWeatherRow['T_CALC'].values[0]
        pcal = selectedWeatherRow['P_CALC'].values[0]
    row['TEMP'] = tcal
    row['PREP'] = pcal
    outdf.loc[index] = row
    print index, tcal, pcal, fwdate, fwtime, ntime, usedSensorId


print "Loop completed"
outdf.to_csv("NYPD_TRAFFIC_DATA.csv")
print "file completed"

This program has been running for days. Not sure why dataframe is too slow. I rewrote the program without dataframe using dictionaries and it completed in a few minutes. Not sure if dataframe is slow or I am not using it correctly. Just posting here for learning.

jpp
  • 159,742
  • 34
  • 281
  • 339
MANISH SHAH
  • 19
  • 1
  • 2
  • Sounds long. Can you add some print statements to see where it gets hung up? Or maybe you know already? – Joshua R. Mar 28 '18 at 23:09
  • If I just see that you're using `.iloc[0]` all the time and iterate over the whole frame, then I come to the conclusion that you're using pandas wrong. Though it would be easier if you just give us some data and only a few lines of code to review – pythonic833 Mar 28 '18 at 23:12
  • Anything with `df.iterrows` is likely to be unbearably slow. Try `df.itertuples` as a first attempt. If it's still too slow, and you can't find a way to vectorise via `pandas`, it may be worthwhile dropping down to `numpy`. Try also to profile your code so you know which methods / functions are expensive. – jpp Mar 28 '18 at 23:22
  • it is not hung up. It is still running. The program has been validated using a small data set. – MANISH SHAH Mar 28 '18 at 23:33
  • @pythonic833 It is the for loop that the program is executing.for index, row in df.iterrows(): – MANISH SHAH Mar 28 '18 at 23:35
  • Well, as @jpp said, you should vectorize this and please, let's concentrate on a few lines of code, so that we can see the input and the desired output. Then we will be able to show you how to vectorize this. – pythonic833 Mar 28 '18 at 23:38
  • The loop extracts the date, time, latitude and longitude from the collision file and matches with the latitude, longitude, date, time from the nearest sensor in the weather file. It extracts the temp, and precipitation from the weather file and appends as columns to the collision file in a new dataframe. It writes the dataframe once 1.2M records are processed. – MANISH SHAH Mar 28 '18 at 23:44
  • From the code, I think the following line is making it slow. selectedWeatherRow = df2.loc[(df2.WBANNO == sensorid) & (df2.LST_DATE == float(fwdate)) & (df2.LST_TIME >= float(fwtime)) & (df2.LST_TIME < ntime) ,['T_CALC', 'P_CALC']] – MANISH SHAH Mar 28 '18 at 23:48
  • Why are you converting to `datetime` and performing time additions in a loop? Look [here](https://stackoverflow.com/questions/26763344/convert-pandas-column-to-datetime) and [here](https://stackoverflow.com/questions/28954093/how-to-add-subtract-time-hours-minutes-etc-from-a-pandas-dataframe-index-wh) for relevant vectorised calculations. In fact, why do you even set *constants* in a loop?.. just do `df['mycol'] = 5`. – jpp Mar 28 '18 at 23:48
  • The format of data time in the collision file is different from the weather file. The collision file has the format 03/11/2018,10:12 while the weather file has the format. 20120101 0400. The weather file has the data by the hour. – MANISH SHAH Mar 28 '18 at 23:51
  • @jpp 5 and 0 are defaults. these get overwritten if the value is found in the weather file. If not, these are used. – MANISH SHAH Mar 28 '18 at 23:54
  • Instead of dataframes, I created a dictionary for the lookup in the weather file and this lookup is faster. def buildWeatherDict(wlines): weather_dict = {} for wline_ in wlines[1:]: line_ = wline_.split(' ') sensorid_ = line_[0] lst_date = line_[3] lst_time = line_[4] if not sensorid_ in weather_dict: weather_dict[sensorid_] = {} if not lst_date in weather_dict[sensorid_]: weather_dict[sensorid_][lst_date] = {} weather_dict[sensorid_][lst_date][lst_time] = line_ return weather_dict – MANISH SHAH Mar 29 '18 at 00:00
  • The code and data is on github. https://github.com/amshah1107/NYPD_Motor_Vehicle_Collision/tree/master/Preprocessing – MANISH SHAH Mar 29 '18 at 00:30
  • Correct me if I'm wrong, but all of those comparisons are killing you. You're doing at least 8 comparisons every loop, and if sensor IDs is large you could be doing far more. If that is the 1.2M row data frame then you're killing yourself finding where the `==` happen over and over and over. You shouldn't be checking equality on 1.2M rows close to 100x8 K times. – ALollz Mar 29 '18 at 01:54
  • It sounds like if you want to solve this, you should coarse grain your weather data into latitude and longitude blocks (like squares on a checker board) and then you should convert your lat-long data in the collisions to the same grid. Then it's a simple `df.merge(df2, on=['lat_bin', 'long_bin'])` and it will take all of 60 seconds if that. – ALollz Mar 29 '18 at 01:57
  • @ALollz I have the same comparisons in the dictionary version and it is faster. I also have the index set in the dataframe. – MANISH SHAH Mar 29 '18 at 20:21

0 Answers0