0

I have a very large csv file, so i used spark and load it into a spark dataframe.
I need to extract the latitude and longitude from each row on the csv in order to create a folium map.
with pandas i can solve my problem with a loop:

for index, row in locations.iterrows():    

    folium.CircleMarker(location=(row["Pickup_latitude"],
                              row["Pickup_longitude"]),
                    radius=20,
                    color="#0A8A9F",fill=True).add_to(marker_cluster)

I found that unlike pandas data-frame the spark data-frame can't be processed by a loop =>how to loop through each row of dataFrame in pyspark .

so i thought that to i can engenieer the problem and cut the big data into hive tables then iterate them .

is it possible to cut the huge SPARK data-frame in hive tables and then iterate the rows with a loop?

A.HADDAD
  • 1,809
  • 4
  • 26
  • 51
  • Please use [these](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) guidelines to improve your question. – Vladislav Varslavans May 30 '18 at 12:41

1 Answers1

1

Generally you don't need to iterate over DataFrame or RDD. You only create transformations (like map) that will be applied to each record and then call some action to call that processing.

You need something like:

dataframe.withColumn("latitude", <how to extract latitude>)
         .withColumn("longitude", <how to extract longitude>)
         .select("latitude", "longitude")
         .rdd
         .map(row => <extract values from Row type>)
         .collect()         // this will move data to local collection

In case if you can't do it with SQL, you need to do it using RDD:

dataframe
     .rdd
     .map(row => <create new row with latitude and longitude>)
     .collect()
Vladislav Varslavans
  • 2,775
  • 4
  • 18
  • 33
  • no you didn't understand me, i don't want to add columns to the dataframe. what i want is to save the langitude and latitude of every row into a variable so i can visualize it within a map later – A.HADDAD May 30 '18 at 22:24
  • i found the action foreach which loop the rdd in spark. unfortunatly the folium objects aren't serializable, so my problem isn't solved – A.HADDAD Jun 08 '18 at 13:14