How to get numeric codes from string values?

Question

I've imported a csv file into spark using pyspark.sql and registered it as a temp table by:

import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext()
from pyspark.sql import HiveContext

sqlCtx= HiveContext(sc)
spark_df = sqlCtx.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("./data/geo_file.csv")
spark_df.registerTempTable("geo_table")

In the table 'geo_table' there is a column called 'geo_location' that have values such as:

US>TX>618

US>NJ>241

US>NJ

My question is, how do I convert these text values into a numeric value? in sql or pyspark.sql?

In Pandas, I would do this

df["geo_location_categories"] = df["geo_location"].astype('category')
df["geo_location_codes"] = df["geo_location_categories"].cat.codes

score 0 · Answer 1 · edited May 23 '17 at 12:09

From my point of view, there several ways to solve your problem. If you just need to convert your "geo_location" column into numeric column - you can use UDF. You can define UDF in the following way (sorry, code in Scala):

val toInt = udf[Int, String](str => {
   // convert your geo location string into integer using existing business logic
})

After that you can use this UDF in the following way:

var df = spark_df.withColumn("geo_location_codes", toInt(spark_df("geo_location")))

Also I think it should be possible to use UserDefinedType as column type. However, it depends on the version of Spark which you use. Take a look at this question for more info: How to define schema for custom type in Spark SQL?

I hope you will find this information useful.

How to get numeric codes from string values?

1 Answers1