I am new to spark(2.x.x) using spark-sql , I created a dataframe using spark sql context.
dff = sqlCtx.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter","\t").load("/home/hduser/Desktop/allFromDesktop/pyspark/creditData.csv")
dff.show()
Income|Limit|Rating|Cards|Age|Education|Gender|Student|Married| Ethnicity|Balance|Age_class|
+---+------------------+-----+------+-----+---+---------+------+-------+-------+----------------+-------+---------+
| 0|14.890999999999998| 3606| 283| 2| 34| 11| Male| No| Yes| Caucasian| 333| 25-34|
| 1| 106.025| 6645| 483| 3| 82| 15|Female| Yes| Yes| Asian| 903| 65+|
| 2|104.59299999999999| 7075| 514| 4| 71| 11| Male| No| No| Asian| 580| 65+|
i tried unit testing the code
tab=dff.select(['Age_class','Balance','Limit']).groupby('Age_class').agg
(F.count('Limit') ,F.mean('Limit').alias('Limit_avg'),F.min('Limit').alias('Limit_min'),F.max('Limit').alias('Limit_max')).withColumn('total',sum(col('Limit')).
over(Window)).withColumn('Percent',col('Limit')*100/col('total')).drop(c
ol('total'))
tab.show()
and found that as soon as agg(),count function executes the original column gets replaced by new .allias column names unit test #1 (first part of the python code executed succesfully)
tab=dff.select(['Age_class','Balance','Limit']).groupby('Age_class').agg(F.count('Limit_count')
,F.mean('Limit').alias('Limit_avg'),
F.min('Limit').alias('Limit_min'),
F.max('Limit').alias('Limit_max'))
tab.show()
With output:
output:
---------+------------+------------------+---------+---------+
|Age_class|count(Limit)| Limit_avg|Limit_min|Limit_max|
+---------+------------+------------------+---------+---------+
| 45-54| 65| 4836.630769230769| 855| 11200|
| <25| 11|3932.6363636363635| 2120| 6375|
| 55-64| 68| 4530.0| 1311| 11966|
##Here you can see i lost my original 'Limit' column ##
Original Dataframe which has a column 'Limit' got deleted (why??) replaced by new column i.e i loose my original dataframe column when second part of the code runs it is unable to find original column in the dataframe . soo.. Rest of the part of the statment i.e *.withColumn('total',sum(col('Limit')). over(Window)).withColumn('Percent',col('Limit')100/col('total')).drop(col('total'))
Shows the error :
Py4JJavaError: An error occurred while calling o2104.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`Limit`' given input columns: [Age_class, Limit_max, Limit_min, Limit_avg, count(Limit)];;
Now when i changed my groupby clause to -->>.groupby('Age_class','Limit') insted of .groupby('Age_class') my code gets executed with desired result ** QUESTION 1: Why i need to add "Limit" column to groupby() clause ?? When i alredy mentioned it in SELECT statement QUESTION 2:After execution the "Age_class" column did not get converted to groups(bins) even though iam using "groupby" see expected result table below i was expecting some thing like ??
+---------+----------+------------------+----------+----------+-------+
|age_class|Limit|count(Limit)|Limit_avg|Limit_min|Limit_max|Percentage
+---------+----------+------------------+----------+----------+-------+
|
45-54 |120|3183.0666666666666|338|12612|12.0||
<25 |150| 2970.733333333333|276|15672|15.0||
55-64 |56| 3493.660714285714|385|15945|5.6||
35-44 |254| 3403.771653543307|250|15857|25.4||
25-34 |397| 3298.823677581864|343|18424|39.7||
65+ |23|3210.1739130434785|571|14896|2.3|
+---------+----------+------------------+----------+----------+-------+
tab=dff.select(['Age_class','Balance','Limit']).groupby('Age_class','Limit').agg(F.count('Limit')
,F.mean('Limit').alias('Limit_avg'),
F.min('Limit').alias('Limit_min'),
F.max('Limit').alias('Limit_max')).withColumn('total',sum(col('Limit')).over(Window)).withColumn('Percent',col('Limit')*100/col('total')).drop(col('total'))
tab.show()
Actual output('Age_class' Not converted into groups(a.k.a bins)):
+---------+-----+------------+---------+---------+---------+-------------------+
|Age_class|Limit|count(Limit)|Limit_avg|Limit_min|Limit_max| Percent|
+---------+-----+------------+---------+---------+---------+-------------------+
| 45-54| 7838| 1| 7838.0| 7838| 7838| 0.4137807247233719|
| 35-44| 886| 1| 886.0| 886| 886|0.04677337612974069|
| 45-54| 4632| 1| 4632.0| 4632| 4632| 0.244530788073317|
| 55-64| 1448| 1| 1448.0| 1448| 1448|0.07644226708336853|
| 55-64| 5107| 1| 5107.0| 5107| 5107| 0.2696068080074331|
| 45-54| 2586| 1| 2586.0| 2586| 2586| 0.1365191316834192|
| 35-44| 4159| 1| 4159.0| 4159| 4159| 0.2195603513810288|
| 45-54| 4943| 1| 4943.0| 4943| 4943| 0.2609489821775488|
| 45-54| 2558| 1| 2558.0| 2558| 2558|0.13504096629782922|
| 25-34| 3969| 1| 3969.0| 3969| 3969|0.20952994340738237|
| 35-44| 5319| 1| 5319.0| 5319| 5319| 0.2807986316411859|
| 45-54| 8100| 1| 8100.0| 8100| 8100| 0.4276121294028212|
| 45-54| 6040| 1| 6040.0| 6040| 6040| 0.3188613903201284|
| 45-54| 4673| 1| 4673.0| 4673| 4673|0.24669524453078806|
| 65+| 2330| 1| 2330.0| 2330| 2330| 0.1230044767294535|
| 45-54| 6922| 1| 6922.0| 6922| 6922| 0.3654235999662134|
| 65+| 4263| 1| 4263.0| 4263| 4263|0.22505067995607736|
| 25-34| 4391| 1| 4391.0| 4391| 4391|0.23180800743306024|
| 65+| 7499| 1| 7499.0| 7499| 7499|0.39588436523355014|
| 45-54| 8732| 1| 8732.0| 8732| 8732| 0.460976433820424|
+---------+-----+------------+---------+---------+---------+-------------------+