Understanding mllib sliding

Question

I know that a sliding window in Spark Structured Streaming is a window on event time that has the window size (in seconds) and the step size (in seconds).

But then I came accross this:

import org.apache.spark.mllib.rdd.RDDFunctions._

sc.parallelize(1 to 100, 10)
  .sliding(3)
  .map(curSlice => (curSlice.sum / curSlice.size))
  .collect()

I don't understand this. There is no event time here, so what does sliding do?

If I comment in the .map line then I get results like:

[I@7b3315a5
[I@8ed9cf
[I@f72203
[I@377008df
[I@540dbda9
[I@22bb5646
[I@1be59f28
[I@2ce45a7b
[I@153d4abb
...

What does it mean to use the sliding method of mllib like that on simple intergers? And what are those Jebrish values?

Shaido · Accepted Answer · 2019-11-30T07:30:11.150

3

In the documentation for sliding we can see:

Returns an RDD from grouping items of its parent RDD in fixed size blocks by passing a sliding window over them. The ordering is first based on the partition index and then the ordering of items within each partition. [...]

So in the case of using sc.parallelize(1 to 100, 10) the order would be consecutive numbers from 1 to 100.

The result of the sliding operation is an Array. Using print will call the toString method for the object, however, Array does not override this method and will use the method defined in Object which is TypeName@hexadecimalHash, see How do I print my Java object without getting "SomeType@2f92e0f4"?.

You can use map(_.toSeq) to convert the array to a Seq which would overrides the toString method (and thus print a list as expected). Or you can use map(_.mkString(",")) to convert the array to a string instead.

The result of using sliding(3) would be (in this fixed order):

1,2,3
2,3,4
5,6,7
...
97,98,99

edited Nov 30 '19 at 07:30

answered Nov 27 '19 at 05:45

Shaido

27,497
23
70
73

I understand the explanation about "SomeType@2f92e0f4", but not about the sliding window. I understand what a sliding window is in Structrued Streamingn, and it means that each time you process one window, while the window defines the frequency of the job and the how old data you want to process. So sliding here just groups the data? Groups what to? And BTW when I run this code it prints every number in a new line, not grouped. – Alon Nov 29 '19 at 07:16
@Alon: In non-streaming applications you will only have a single dataframe/RDD - there is no concept of frequency since there won't be any new data. Sliding here is equivalent to [sliding in scala](https://www.scala-lang.org/api/2.12.2/scala/collection/immutable/List.html#sliding(size:Int):Iterator[Repr]): it will group the data into equal sized arrays. As long as you don't flatten the result after collecting it, then you should have groups of data. Try using the string converting method I mention above and see the result. (note that I corrected the result part in my answer). – Shaido Nov 30 '19 at 07:34

Understanding mllib sliding

1 Answers1