Using Dataframe instead of spark sql for data analysis

Question

Below is the sample spark sql I wrote to get the count of male and female enrolled in an agency.I used sql to generate the output, Is there a way to do similar thing using dataframe only not sql.

val districtWiseGenderCountDF = hiveContext.sql("""
                                                   | SELECT District, 
                                                   |        count(CASE WHEN Gender='M' THEN 1 END) as male_count, 
                                                   |        count(CASE WHEN Gender='F' THEN 1 END) as FEMALE_count 
                                                   | FROM agency_enrollment 
                                                   | GROUP BY District
                                                   | ORDER BY male_count DESC, FEMALE_count DESC
                                                   | LIMIT 10""".stripMargin)

what version of spark are you using? – James Tobin May 15 '17 at 17:09 — James Tobin, May 15 '17 at 17:09
I am using spark 2 in Hortonworks sandbox – Deepak_Spark_Beginner May 15 '17 at 17:20 — Deepak_Spark_Beginner, May 15 '17 at 17:20

score 0 · Accepted Answer · edited May 23 '17 at 12:10

0

Starting with Spark 1.6 you can use pivot + group by to achieve what you'd like

without sample data (and my own availability of spark>1.5) here's a solution that Should work (not tested)

val df = hiveContext.table("agency_enrollment")
df.groupBy("district","gender").pivot("gender").count

see How to pivot DataFrame? for a generic example

edited May 23 '17 at 12:10

Community

1
1

answered May 15 '17 at 17:31

James Tobin

3,070
19
35

Using Dataframe instead of spark sql for data analysis

1 Answers1