Add a column to a Spark DataFrame and calculate a value for it

Question

I have a CSV document I'm loading into a SQLContext that contains latitude and longitude columns.

val sqlContext = new org.apache.spark.sql.SQLContext(sc);
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "false").option("delimiter","\t").schema(customSchema).load(inputFile);

CSV example

metro_code, resolved_lat, resolved_lon
602, 40.7201, -73.2001

I'm trying to figure out the best way to add a new column and calculate the GeoHex for each row. Hashing the lat and long is easy with the geohex package. I think I need to run the parallelize method or I've seen some examples passing a function to withColumn.

Possible duplicate of [Updating a dataframe column in spark](http://stackoverflow.com/questions/29109916/updating-a-dataframe-column-in-spark) — jopasserat, Jan 13 '16 at 18:43

score 12 · Accepted Answer · answered Jan 13 '16 at 19:24

12

Wrapping required function with an UDF should do the trick:

import org.apache.spark.sql.functions.udf
import org.geohex.geohex4j.GeoHex

val df = sc.parallelize(Seq(
  (Some(602), 40.7201, -73.2001), (None, 5.7805, 139.5703)
)).toDF("metro_code", "resolved_lat", "resolved_lon")

def geoEncode(level: Int) = udf(
  (lat: Double, long: Double) => GeoHex.encode(lat, long, level))

df.withColumn("code", geoEncode(9)($"resolved_lat", $"resolved_lon")).show
// +----------+------------+------------+-----------+
// |metro_code|resolved_lat|resolved_lon|       code|
// +----------+------------+------------+-----------+
// |       602|     40.7201|    -73.2001|PF384076026|
// |      null|      5.7805|    139.5703|PR081331784|
// +----------+------------+------------+-----------+

answered Jan 13 '16 at 19:24

zero323

322,348
103
959
935

I cant get past this error `value $ is not a member of StringContext` and google searches don't come back with anything. I'll have to find the scala docs on $. – jspooner Jan 22 '16 at 01:37
1

This worked for me `df.withColumn("gh11", geoEncode(11)(df("resolved_lat"),df("resolved_lon"))).show` – jspooner Jan 22 '16 at 02:43
@jspooner @zero323 I believe the dollar syntax needs `import sqlContext.implicits.StringToColumn`. And there is no sqlContext in your snippet to import it from. ? – Iain Mar 21 '16 at 09:46
You need, `sqlContext` for `toDF` so it is implicitly there :) Regarding `StringToColumn` and `$` method but it assume this is simply a convention like `sc` for `SparkContext`. – zero323 Mar 21 '16 at 14:03
@zero323 Thanks a tonne!! – Anbarasu Dec 13 '16 at 20:48

Add a column to a Spark DataFrame and calculate a value for it

1 Answers1