I am reading many images and I would like to work on a tiny subset of them for developing. As a result I am trying to understand how spark and python could make that happen:
In [1]: d = sqlContext.read.parquet('foo')
In [2]: d.map(lambda x: x.photo_id).first()
Out[2]: u'28605'
In [3]: d.limit(1).map(lambda x: x.photo_id)
Out[3]: PythonRDD[31] at RDD at PythonRDD.scala:43
In [4]: d.limit(1).map(lambda x: x.photo_id).first()
// still running...
..so what is happening? I would expect the limit() to run much faster than what we had in [2]
, but that's not the case*.
Below I will describe my understanding, and please correct me, since obviously I am missing something:
d
is an RDD of pairs (I know that from the schema) and I am saying with the map function:i) Take every pair (which will be named
x
and give me back thephoto_id
attribute).ii) That will result in a new (anonymous) RDD, in which we are applying the
first()
method, which I am not sure how it works$, but should give me the first element of that anonymous RDD.In
[3]
, we limit thed
RDD to 1, which means that despited
has many elements, use only 1 and apply the map function to that one element only. TheOut [3]
should be the RDD created by the mapping.- In
[4]
, I would expect to follow the logic of[3]
and just print the one and only element of the limited RDD...
As expected, after looking at the monitor, [4] seems to process the whole dataset, while the others aren't, so it seems that I am not using limit()
correctly, or that that's not what am I looking for:
Edit:
tiny_d = d.limit(1).map(lambda x: x.photo_id)
tiny_d.map(lambda x: x.photo_id).first()
The first will give a PipelinedRDD
, which as described here, it will not actually do any action, just a transformation.
However, the second line will also process the whole dataset (as a matter of fact, the number of Tasks now are as many as before, plus one!).
*[2] executed instantly, while [4] is still running and >3h have passed..
$I couldn't find it in the documentation, because of the name.