0

I have a dataframe with date,accountid,score,genre and viewed columns, what I am trying to do is add a weight for each account based on a genre he watched per day if he watched a genre he get an arbitrary weight let say 0.20. If he doesn't watch the weight become 0.0, I need to create a new column for this calculation by grouping the accountid and genre per day, that way each account-genre combo get the weight for each day. I understand we don't have transpose function in Scala so that I can transpose the genre column and create weights for each account per day. I am thinking of creating a dictionary of weights for each account and append the weight for the day and apply to the new column. Any inputs or suggestions would be helpful.

  val df = sc.parallelize(Seq(("2018-01-01", 100.5,"id1","action",1),
  ("2018-01-02", 120.6,"id1","action",1),
  ("2018-01-03", 450.2,"id2","sports",1),
  ("2018-01-04", 200.7,"id1","action",1),
  ("2018-01-06", 121.4,"id3","live",1))).toDF("date", "score","accountid","genre","viewed")

output

+----------+-----+---------+------+------+
|      date|score|accountid| genre|viewed|
+----------+-----+---------+------+------+
|2018-01-01|100.5|      id1|action|     1|
|2018-01-02|120.6|      id1|action|     1|
|2018-01-03|450.2|      id2|sports|     1|
|2018-01-04|200.7|      id1|action|     1|
|2018-01-06|121.4|      id3|  live|     1|
+----------+-----+---------+------+------+

expected output

+----------+-----+---------+------+--------------------------+
|      date|score|accountid| genre|weights                   |
+----------+-----+---------+------+--------------------------+
|2018-01-01|100.5|      id1|action|     [0.20]               |
|2018-01-02|120.6|      id1|action|     [0.20,0.20]          |
|2018-01-03|450.2|      id2|sports|     [0,0,0.20]           |
|2018-01-04|200.7|      id1|action|     [0.20,0.20,0,0.20]   | 
|2018-01-06|121.4|      id3|  live|     [0,0,0,0,0.2]         |
+----------+-----+---------+------+--------------------------+

when the id1 has a genre recorded on firstrow then we concat the weight list with 0.20 for that day, for next day he has the same genre so we concat with 0.20, on day four id1 again shows up and weight is appended again. Weight list grows as the day goes by, if the id1 has no genre recorded then we append it to 0. Same behavior apply for every other accountid's.

Masterbuilder
  • 499
  • 2
  • 12
  • 24
  • where is the expected output? – Ramesh Maharjan May 12 '18 at 01:33
  • Ramesh, please see my edits for expected output – Masterbuilder May 12 '18 at 14:02
  • can you explain the output as well? its getting difficult to understand – Ramesh Maharjan May 12 '18 at 14:18
  • The list growing part is impossible to parallelize, as it makes each row dependent on the ones above them. Why do you want every row to record the whole history of weights for that account genre pair anyways? – Midiparse May 13 '18 at 19:45
  • My primary objective is to group "accountId" by index dates, that means I need to calculate the weights for all accountid's within a date range, irrespective of they have a genre associated with that date. The closest example I can think of [link](https://stackoverflow.com/questions/44978196/pandas-filling-missing-dates-and-values-within-group) – Masterbuilder May 13 '18 at 21:25

0 Answers0