effective way to groupby without using pivot in pyspark

Question

I have a query where I need to calculate memory utilization using pyspark. I had achieved this with python pandas using pivot but now I need to do it in pyspark and pivoting would be an expensive function so I would like to know if there is any alternative in pyspark for this solution

time_stamp          Hostname    kpi kpi_subtype value_current
2019/08/17 10:01:05 Server1     memory  Total       100
2019/08/17 10:01:06 Server1     memory  used        35
2019/08/17 10:01:09 Server1     memory  buffer      8
2019/08/17 10:02:04 Server1     memory  cached      10
2019/08/17 10:01:05 Server2     memory  Total       100
2019/08/17 10:01:06 Server2     memory  used        42
2019/08/17 10:01:09 Server2     memory  buffer      7
2019/08/17 10:02:04 Server2     memory  cached      9
2019/08/17 10:07:05 Server1     memory  Total       100
2019/08/17 10:07:06 Server1     memory  used        35
2019/08/17 10:07:09 Server1     memory  buffer      8
2019/08/17 10:07:04 Server1     memory  cached      10
2019/08/17 10:08:05 Server2     memory  Total       100
2019/08/17 10:08:06 Server2     memory  used        35
2019/08/17 10:08:09 Server2     memory  buffer      8
2019/08/17 10:08:04 Server2     memory  cached      10

Which need to be transformed to

time_stamp      Hostname    kpi Percentage
2019-08-17 10:05:00 Server1     memory  17
2019-08-17 10:05:00 Server2     memory  26
2019-08-17 10:10:00 Server1     memory  17
2019-08-17 10:10:00 Server2     memory  17

Python code i used

df3 = pd.read_csv('/home/yasin/Documents/IMI/Data/memorry sample.csv')
df3['time_stamp'] = pd.to_datetime(df3['time_stamp'])
ns5min=5*60*1000000000 
df3['time_stamp'] = pd.to_datetime(((df3['time_stamp'].astype(np.int64) // ns5min + 1 ) * ns5min))
df4 = df3.pivot_table('value_current' , ['time_stamp' , 'Hostname ' , 'kpi' ], 'kpi_subtype')
df4 = df4.reset_index()
df4['Percentage'] = ((df4['Total'] - (df4['Total'] - df4['used'] + df4['buffer'] + df4['cached'])) / df4['Total']) * 100

Looking for a to replicate this in pyspark and a more efficient way in python as pivot is an expensive operation and I need to perform this every 5 mins on a really large dataset

each time you will be having multiple record sets for server 1 & 2 ? — vikrant rana, Aug 18 '19 at 04:04
What is the timeframe in which first set of server1&server2 lies? Is it 5 mins? will it extend beyond 5 mins or it will remain in that timeframe — vikrant rana, Aug 18 '19 at 09:03
It will be the same set of servers every 5 mins. in real there will be about 20000 servers — yasin mohammed, Aug 19 '19 at 06:32

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

Pivoting is expensive when the list of values that are translated to columns is unknown. Spark has an overloaded pivot method that takes them as an argument.

def pivot(pivotColumn: String, values: Seq[Any])

In case they aren't known Spark must sort and collect the distinct values from your dataset. Otherwise, the logic is pretty straightforward and described here.

The implementation adds a new logical operator (o.a.s.sql.catalyst.plans.logical.Pivot). That logical operator is translated by a new analyzer rule (o.a.s.sql.catalyst.analysis.Analyzer.ResolvePivot) that currently translates it into an aggregation with lots of if statements, one expression per pivot value.

For example, df.groupBy("A", "B").pivot("C", Seq("small", "large")).sum("D") would be translated into the equivalent of df.groupBy("A", "B").agg(expr(“sum(if(C = ‘small’, D, null))”), expr(“sum(if(C = ‘large’, D, null))”)). You could have done this yourself but it would get long and possibly error prone quickly.

Without pivoting I would do something like that:

val in = spark.read.csv("input.csv")
      //cast to the unix timestamp
      .withColumn("timestamp", unix_timestamp($"time_stamp", "yyyy/MM/dd HH:mm:ss").cast(TimestampType))
      .drop($"time_stamp")

Now we can group our dataset by the time window with hostname and collect KPI metrics into a map.
There is an excellent answer describing just that.

val joinMap = udf { values: Seq[Map[String, Double]] => values.flatten.toMap }

val grouped = in.groupBy(window($"timestamp", "5 minutes"), $"Hostname")
  .agg(joinMap(collect_list(map($"kpi_subtype", $"value_current".cast(DoubleType)))).as("metrics"))

Output

+------------------------------------------+--------+-------------------------------------------------------------+
|window                                    |Hostname|metrics                                                      |
+------------------------------------------+--------+-------------------------------------------------------------+
|[2019-08-17 10:00:00, 2019-08-17 10:05:00]|Server1 |[Total -> 100.0, used -> 35.0, buffer -> 8.0, cached -> 10.0]|
|[2019-08-17 10:00:00, 2019-08-17 10:05:00]|Server2 |[Total -> 100.0, used -> 42.0, buffer -> 7.0, cached -> 9.0] |
|[2019-08-17 10:05:00, 2019-08-17 10:10:00]|Server1 |[Total -> 100.0, used -> 35.0, buffer -> 8.0, cached -> 10.0]|
|[2019-08-17 10:05:00, 2019-08-17 10:10:00]|Server2 |[Total -> 100.0, used -> 35.0, buffer -> 8.0, cached -> 10.0]|
+------------------------------------------+--------+-------------------------------------------------------------+

Now we define some aliases and a simple select statement:

val total = col("metrics")("Total")
val used = col("metrics")("used")
val buffer = col("metrics")("buffer")
val cached = col("metrics")("cached")

val result = grouped.select($"window", $"Hostname",
          (total - ((total - used + buffer + cached) / total) * 100).as("percentage"))

And here we go:

+------------------------------------------+--------+----------+
|window                                    |Hostname|percentage|
+------------------------------------------+--------+----------+
|[2019-08-17 10:00:00, 2019-08-17 10:05:00]|Server1 |17.0      |
|[2019-08-17 10:00:00, 2019-08-17 10:05:00]|Server2 |26.0      |
|[2019-08-17 10:05:00, 2019-08-17 10:10:00]|Server1 |17.0      |
|[2019-08-17 10:05:00, 2019-08-17 10:10:00]|Server2 |17.0      |
+------------------------------------------+--------+----------+

Hi... I believe this code is in scala. Since the rest of my code is in pyspark need to integrate this with my existing code. — yasin mohammed, Aug 19 '19 at 06:39
All of the used functions are Spark ones, don't you have them in python? — Gelerion, Aug 19 '19 at 09:54
joinMap = udf ({ values: Seq[Map[String, Double]] => values.flatten.toMap} ). I modified this to python but I am getting a sytax error for the symbol '=>' — yasin mohammed, Aug 19 '19 at 14:22
Use `Python` UDF syntax instead: https://docs.databricks.com/spark/latest/spark-sql/udf-python.html. Also, your types are different, here you have an example for python. https://stackoverflow.com/questions/41288622/pyspark-create-maptype-column-from-existing-columns — Gelerion, Aug 19 '19 at 14:46

score 1 · Answer 2 · answered Aug 20 '19 at 13:23

1st is using pivot in spark and the 2nd is using map.

1st Solution

df = sql.read.csv("/home/yasin/Documents/IMI/Data/memorry sample.csv", header = "True").withColumn("timestamp", unix_timestamp("time_stamp", "yyyy/MM/dd HH:mm:ss").cast(TimestampType())).drop("time_stamp")
df = df.withColumn("unixtime",unix_timestamp(df["timestamp"],"yyyy/MM/dd HH:mm:ss"))
df = df.withColumn("unixtime2",(round(df["unixtime"]/300)*300).cast("timestamp"))
df = df.groupBy("unixtime2" , "Hostname" , "kpi").pivot("kpi_subtype").agg(mean(df["value_current"]))
df = df.withColumn("Percentage", (df["Total"] - (df["Total"] - df["Used"] + df["buffer"] + df["cached"])) /df["Total"] * 100)

2nd Solution

df = sql.read.csv("/home/yasin/Documents/IMI/Data/memorry sample.csv", header = "True").withColumn("timestamp", unix_timestamp("time_stamp", "yyyy/MM/dd HH:mm:ss").cast(TimestampType())).drop("time_stamp")
df = df.withColumn("unixtime",unix_timestamp(df["timestamp"],"yyyy/MM/dd HH:mm:ss"))
df = df.withColumn("unixtime2",(round(df["unixtime"]/300)*300).cast("timestamp"))
df = df.withColumn("value_current2",df["value_current"].cast("Float"))
df = df.groupBy("unixtime2" , "Hostname" , "kpi").agg(collect_list(create_map("kpi_subtype","value_current2")).alias("mapped"))
nn=df.withColumn("formula" ,  ( df["mapped"][0]["Total"].cast("Float") - (( df["mapped"][0]["Total"].cast("Float") - df["mapped"][1]["used"].cast("Float")  + df["mapped"][2]["buffer"].cast("Float") + df["mapped"][3]["cached"].cast("Float") ) / df["mapped"][0]["Total"].cast("Float") ) * 100).cast("Float"))

nice. looking some way to do it in Hive – vikrant rana Aug 20 '19 at 17:55 — vikrant rana, Aug 20 '19 at 17:55
I have not used hive so not aware about it. – yasin mohammed Aug 21 '19 at 04:23 — yasin mohammed, Aug 21 '19 at 04:23

effective way to groupby without using pivot in pyspark

2 Answers2

Linked