How to convert numerical values to a categorical variable using pyspark

Question

pyspark dataframe which have a range of numerical variables.

for eg

my dataframe have a column value from 1 to 100.

1-10 - group1<== the column value for 1 to 10 should contain group1 as value 11-20 - group2 . . . 91-100 group10

how can i achieve this using pyspark dataframe

Hi, welcome to stackoverflow. Please give us a [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) and show us your desired output. — cronoik, Apr 10 '19 at 12:31

score 1 · Accepted Answer · answered Apr 10 '19 at 13:01

# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
|  1| 54|
|  2|  7|
|  3| 72|
|  4| 99|
+---+---+

Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.

# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID|    Var|
+---+-------+
|  1| group6|
|  2| group1|
|  3| group8|
|  4|group10|
+---+-------+

You can also replace 1+floor with ceil function, anyways please mark it as answered if this solved your problem — sramalingam24, Apr 10 '19 at 14:40
@sramalingam24 haha, in haste I almost forgot. Thanks for the remark :) — cph_sto, Apr 10 '19 at 16:50

How to convert numerical values to a categorical variable using pyspark

1 Answers1