0

I am new to SPARK and Scala. So wondering how to do this. In Python Pandas I would just .apply() to the grouped column and this would work. Don't know how to do it in SPRAK using Scala.

I have data frame of user names and sites they have visited. I want to combine the site column I get with an array of sites (into a giant string) in it after groupBy "user_name".

val df = Seq(("user1", "facebook.com"), ("user1", "msn.com"), ("user1", "linkedin.com"),("user2","google.com"),("user2","apple.com")).toDF("user_name", "sites")

df.show

df.show
+---------+------------+
|user_name|       sites|
+---------+------------+
|    user1|facebook.com|
|    user1|     msn.com|
|    user1|linkedin.com|
|    user2|  google.com|
|    user2|   apple.com|
+---------+------------+

val grp = df.groupBy("user_name")

Now I want to apply this to the grouped "sites" column

var jn = (url: Array[String]) => url.sortWith(_ < _).mkString(":")

What I want:

+---------+---------------------------------+
|user_name|       sites                     |
+---------+---------------------------------+
|    user1|facebook.com:linkedin.com:msn.com|
|    user2|apple.com:google.com             |
+---------+---------------------------------+

How do I convert the groupedData to a DataFrame in SPARK ?

How do print the groped dataframe as is right after groupby here ?

I have used a udf to change a column in a SPARK dataframe but don't know how to use that on a groupedData. Is their a way to do that?

cryp
  • 2,285
  • 3
  • 26
  • 33
  • See also: http://stackoverflow.com/q/35258408/1560062 and http://stackoverflow.com/q/32902982/1560062 for explanation about `GroupedData` – zero323 May 21 '16 at 23:01
  • Thanks ! But I am still unsure as to what to operate on. My udf is taking in an array of strings and I am guessing that after grouping by user_names I will be getting an array of sites to operate on. I did something like this: val jnn = udf(jn); df.groupBy("user_name").agg(jnn($"sites")) and got an error. **org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(site)' due to data type mismatch: argument 1 requires array type, however, 'site' is of string type** – cryp May 22 '16 at 01:14
  • 1
    You cannot use UDFs for aggregations. If you want to aggregate data you have create an UDAF which has completely different API, shown in the other question. But it in this case there is no need for that. `collect_list` and apply UDF on the result. – zero323 May 22 '16 at 15:26
  • Great thanks !! That worked. I dint know about collect_list. – cryp May 22 '16 at 22:13

0 Answers0