I've a sparse dataset like this:
ip,ts,session
"123","1","s1"
"123","2",""
"123","3",""
"123","4",""
"123","10","s2"
"123","11",""
"123","12",""
"222","5","s6"
"222","6",""
"222","7",""
I need to make it dense like this:
ip,ts,session
"123","1","s1"
"123","2","s1"
"123","3","s1"
"123","4","s1"
"123","10","s2"
"123","11","s2"
"123","12","s2"
"222","5","s6"
"222","6","s6"
"222","7","s6"
I know how to do it using RDD - re-partition by ip and within partitionMap groupBy(ip).sortBy(ts).scan()(): scan function will carry over prior calculated value to the next iteration and decide to use prior value or keep current and pass new choice to next "scan" iteration
Now I'm trying to use DataFrame only, without going back to RDD. I was looking at Window functions, but all I could come up with is first value within group, which is not the same. Or I just do not understand how to create correct range.