I have such a csv file with data (a big file>20GB) as follows:
ObjectID,Lon,Lat,Speed,GPSTime
182163,116.367520,40.024680,29.00,2016-07-04 09:01:09.000
116416,116.694693,39.785382,0.00,2016-07-04 09:01:17.000
Using pyspark (rdd, map and reduce), I want to process geographic data and check for each row if Latitude, Longitude is inside a polygon, then write the row to an output file.
This is the original code without using spark.
polygon = Polygon(data['features'][cityIdx]['geometry']['coordinates'][0])
with open(outFileName, 'w', newline='') as outfile:
writer = csv.writer(outfile)
for chunk in pd.read_csv(inFileName,chunksize=chunksize,sep=','):
ObjectIDs = list(chunk['ObjectID'])
Lons = list(chunk['Lon'])
Lats = list(chunk['Lat'])
Speeds = list(chunk['Speed'])
Directs = list(chunk['Direct'])
Mileages = list(chunk['Mileage'])
GPSTimes = list(chunk['GPSTime'])
for ObjectID,Lon,Lat,Speed,Direct,Mileage,GPSTime in zip(ObjectIDs,Lons,Lats,Speeds,Directs,Mileages,GPSTimes):
point = Point(Lon, Lat)
if(polygon.contains(point)):
writer.writerow([ObjectID,Lon,Lat,Speed,Direct,Mileage,GPSTime])
How can I manage to do this using .rdd.map( fun ).reduce(fun ). I thought about lambda expression but I couldn't formulate a spark runnable code.