0

pyspark dataframe which have a range of numerical variables.

for eg

my dataframe have a column value from 1 to 100.

1-10 - group1<== the column value for 1 to 10 should contain group1 as value 11-20 - group2 . . . 91-100 group10

how can i achieve this using pyspark dataframe

  • Hi, welcome to stackoverflow. Please give us a [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) and show us your desired output. – cronoik Apr 10 '19 at 12:31

1 Answers1

1
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
|  1| 54|
|  2|  7|
|  3| 72|
|  4| 99|
+---+---+

Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.

# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID|    Var|
+---+-------+
|  1| group6|
|  2| group1|
|  3| group8|
|  4|group10|
+---+-------+
cph_sto
  • 7,189
  • 12
  • 42
  • 78