0

I have one table with a lot of type of data, and some of the data has one information that is really important to analyse the rest of the data. This is the table that I have

    name   |player_id|data_ms|coins|progress |
 progress  |  1223   | 10    |     |     128 |
 complete  |  1223   | 11    |  154|         |
 win       |  1223   | 9     |  111|         |
 progress  |  1223   | 11    |     |     129 |
 played    |  1111   | 19    |  141|         |
 progress  |  1111   | 25    |     |     225 |

This is the table that I want

    name    |player_id|data_ms|coins|progress |
 progress   |  1223   | 10    |     |     128 |
 complete   |  1223   | 11    |  154|     128 |
 win        |  1223   | 9     |  111|     129 |
 progress   |  1223   | 11    |     |     129 |
 played     |  1111   | 19    |  141|     225 |
 progress   |  1111   | 25    |     |     225 |

I need to find the progress of the player, using the condition that, it has to be the first progress emitted after the data_ms (epoch unixtimstamp) of this event.

My table has 4 bilions lines of data, it's partitioned by data.

I tried to create a UDF function that should read the table filtering it, but it's not an option since you can't serialize spark to an UDF.

Any idea of how should I do this?

1 Answers1

0

It seems like you want to fill gaps in column progress. I didn't really understand the condition but if it's based on data_ms then your hive query should look like this:

dataFrame.createOrReplaceTempView("your_table")

val progressDf = sparkSession.sql(
  """
    SELECT name, player_id, data_ms, coins,  
        COALESCE(progress, LAST_VALUE(progress, TRUE) over (PARTITION BY player_id ORDER BY data_ms ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS progress
    FROM your_table;
  """
)
Yuriy Bondaruk
  • 4,512
  • 2
  • 33
  • 49