I've imported a csv file into spark using pyspark.sql and registered it as a temp table by:
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext()
from pyspark.sql import HiveContext
sqlCtx= HiveContext(sc)
spark_df = sqlCtx.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("./data/geo_file.csv")
spark_df.registerTempTable("geo_table")
In the table 'geo_table' there is a column called 'geo_location' that have values such as:
US>TX>618
US>NJ>241
US>NJ
My question is, how do I convert these text values into a numeric value? in sql or pyspark.sql?
In Pandas, I would do this
df["geo_location_categories"] = df["geo_location"].astype('category')
df["geo_location_codes"] = df["geo_location_categories"].cat.codes