I'm trying to port a code from R to Scala to perform Customer Analysis. I have already computed Recency, Frequency and Monetary factors on Spark into a DataFrame.
Here is the schema of the Dataframe :
df.printSchema
root
|-- customerId: integer (nullable = false)
|-- recency: long (nullable = false)
|-- frequency: long (nullable = false)
|-- monetary: double (nullable = false)
And here is a data sample as well :
df.order($"customerId").show
+----------+-------+---------+------------------+
|customerId|recency|frequency| monetary|
+----------+-------+---------+------------------+
| 1| 297| 114| 733.27|
| 2| 564| 11| 867.66|
| 3| 1304| 1| 35.89|
| 4| 287| 25| 153.08|
| 6| 290| 94| 316.772|
| 8| 1186| 3| 440.21|
| 11| 561| 5| 489.70|
| 14| 333| 57| 123.94|
I'm trying to find the intervals for on a quantile vector for each column given a probability segment.
In other words, given a probability vector of non-decreasing breakpoints, in my case it will be the quantile vector, find the interval containing each element of x;
i.e. (pseudo-code),
if i <- findInterval(x,v),
for each index j in x
v[i[j]] ≤ x[j] < v[i[j] + 1] where v[0] := - Inf, v[N+1] := + Inf, and N <- length(v).
In R, this translates to the following code :
probSegment <- c(0.0, 0.25, 0.50, 0.75, 1.0)
RFM_table$Rsegment <- findInterval(RFM_table$Recency, quantile(RFM_table$Recency, probSegment))
RFM_table$Fsegment <- findInterval(RFM_table$Frequency, quantile(RFM_table$Frequency, probSegment))
RFM_table$Msegment <- findInterval(RFM_table$Monetary, quantile(RFM_table$Monetary, probSegment))
I'm kind of stuck with the quantile function thought.
In an earlier discussion with @zero323, he suggest that I used the percentRank
window function which can be used as a shortcut. I'm not sure that I can apply the percentRank function in this case.
How can I apply a quantile function on a Dataframe column with Scala Spark? If this is not possible, can I use the percentRank function instead?
Thanks.