0

I have a data frame and I want to roll up the data into 7days and do some aggregation on some of the function.

I have a pyspark sql dataframe like ------

Sale_Date|P_1|P_2|P_3|G_1|G_2|G_3|Total_Sale|Sale_Amt|Promo_Disc_Amt  |

|2013-04-10| 1| 9| 1| 1| 1| 1| 1| 295.0|0.0|
|2013-04-11| 1| 9| 1| 1| 1| 1| 3| 567.0|0.0| 
|2013-04-12| 1| 9| 1| 1| 1| 1| 2| 500.0|200.0|   
|2013-04-13| 1| 9| 1| 1| 1| 1| 1| 245.0|20.0| 
|2013-04-14| 1| 9| 1| 1| 1| 1| 1| 245.0|0.0|
|2013-04-15| 1| 9| 1| 1| 1| 1| 2| 500.0|200.0|  
|2013-04-16| 1| 9| 1| 1| 1| 1| 1| 250.0|0.0|  

I have applied a window function over the data frame as follows -

days = lambda i: i * 86400
windowSp = Window().partitionBy(dataframeOfquery3["P_1"],dataframeOfquery3["P_2"],dataframeOfquery3["P_3"],dataframeOfquery3["G_1"],dataframeOfquery3["G_2"],dataframeOfquery3["G_3"])\
          .orderBy(dataframeOfquery3["Sale_Date"].cast("timestamp").cast("long").desc())\
          .rangeBetween(-(days(7)), 0)

Now I want to perform some aggregation i.e. applying some windows functions like the following --

df = dataframeOfquery3.select(min(dataframeOfquery3["Sale_Date"].over(windowSp).alias("Sale_Date")))
df.show()

But it is giving following error.

py4j.protocol.Py4JJavaError: An error occurred while calling o138.select.
: org.apache.spark.sql.AnalysisException: Could not resolve window function 'min'. Note that, using window functions currently requires a HiveContext;

I am using Apache Spark 1.6.0 Pre-built on Hadoop.

eliasah
  • 39,588
  • 11
  • 124
  • 154
Sayak Ghosh
  • 13
  • 1
  • 6

1 Answers1

3

The error kind of says everything :

py4j.protocol.Py4JJavaError: An error occurred while calling o138.select.
: org.apache.spark.sql.AnalysisException: Could not resolve window function 'min'. Note that, using window functions currently requires a HiveContext;

You'll need a version of spark that supports hive (build with hive) than you can declare a hivecontext :

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

and then use that context to perform your window function.

In python :

# sc is an existing SparkContext.
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

You can read further about the difference between SQLContextand HiveContext here.

SparkSQL has a SQLContext and a HiveContext. HiveContext is a super set of the SQLContext. The Spark community suggest using the HiveContext. You can see that when you run spark-shell, which is your interactive driver application, it automatically creates a SparkContext defined as sc and a HiveContext defined as sqlContext. The HiveContext allows you to execute SQL queries as well as Hive commands. The same behavior occurs for pyspark.

Community
  • 1
  • 1
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Yes. I have seen the error. But I have followed the following threads. [thread 1](http://stackoverflow.com/questions/32769328/how-to-use-window-functions-in-pyspark-using-dataframes) , [thread 2](http://stackoverflow.com/questions/33207164/spark-window-functions-rangebetween-dates) and [Databricks thread](https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html).. Into all of the above, window function is properly working with pyspark sqlcontext. @eliasah – Sayak Ghosh Mar 15 '16 at 13:00
  • It's kind of tricky in some kind of environments. I know all of those threads. Those threads don't present the hivecontext but it's actually needed and also they don't even talk about cluster configurations. I have presented you the way I do it. – eliasah Mar 15 '16 at 13:03
  • Is there any way to use pyspark.sql.window into pyspark sql context without HiveContext? or How can I manage this type of situation on pyspark.sql.sqlcontext? Please suggest @eliasah – Sayak Ghosh Mar 15 '16 at 13:04
  • @SayakGhosh It works because by default we use an instance of HiveContext as a SQLContext. And no, you cannot use window functions outside HiveContext. Why do you want to avoid it? In PySpark it doesn't add to your dependencies. – zero323 Mar 15 '16 at 13:04
  • You can't manage that situation on a regular sqlcontext alone, a hivecontext is necessary unfortunately. – eliasah Mar 15 '16 at 13:05
  • Yes, I was pretty much confused after followed those.. As I am new to this environment, I have something to know. Would you please tell me is there any major difference between SQLContext vs HiveContext in creation of reading text file or in creation of Dataframe? @eliasah – Sayak Ghosh Mar 15 '16 at 13:09
  • I have edited my answer. I hope it answer your question. – eliasah Mar 15 '16 at 13:14
  • Actually I am not avoiding HiveContext but I have started building my application on a regular SQLContext without knowing this difference. Thats why. @zero323 – Sayak Ghosh Mar 15 '16 at 13:15
  • If you use regular SQLContext, you can't use hive queries like window functions. – eliasah Mar 15 '16 at 13:16
  • Thanks a lot @eliasah for your guidance. Would you please suggest the repository link from where I can get the Apache Spark 1.6.0 with Hive Pre built? – Sayak Ghosh Mar 15 '16 at 13:18
  • You'll have to build it from source. http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support – eliasah Mar 15 '16 at 13:20
  • 1
    Default pre-built binaries from downloads work as well. – zero323 Mar 15 '16 at 13:22
  • Thanks @eliasah .. If I will get some issues while installing I will let you know. Please help when needed. – Sayak Ghosh Mar 15 '16 at 13:26
  • @zero323 Thanks for your suggestion too. I will be very helpful if you suggest me some links from where I can get the default pre-built binaries of Spark latest? – Sayak Ghosh Mar 15 '16 at 13:29
  • I want to share something. I have downloaded the Spark distribution from [here](http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0.tgz) @eliash .. Is it a correct distribution to use HiveContext? – Sayak Ghosh Mar 15 '16 at 13:34
  • normally yes it is! but in case the error persists, you have to download the source and compile spark. you can find details on the spark documentation – eliasah Mar 15 '16 at 13:35
  • After unpacking I have used `sbt/sbt assembly` to build the spark. Is that the right command to do so? Or I have to use `sbt/sbt -phive assembly` to build the spark to use Hive Context as Sql Context? @eliasah please suggest – Sayak Ghosh Mar 16 '16 at 11:58
  • here you go : http://spark.apache.org/docs/latest/building-spark.html#building-with-sbt ps: the spark documentation is perfect. You can always find what you are looking for there. Consider it as the bible for spark ! – eliasah Mar 16 '16 at 12:27
  • 1
    Thanks a lot. Now hive context is properly working for me. @eliasah – Sayak Ghosh Mar 17 '16 at 06:57
  • @eliasah I have go through the [thread](http://stackoverflow.com/questions/33207164/spark-window-functions-rangebetween-dates) . I have a similar situation too. But my Sale_Date was not rolling up. Please suggests. – Sayak Ghosh Mar 17 '16 at 11:21
  • I don't understand your question @SayakGhosh – eliasah Mar 17 '16 at 11:36
  • Don't do that ! Please ask a new question ! @SayakGhosh – eliasah Mar 17 '16 at 12:04