Coming from the beautiful world of c, I am trying understand this behavior:
In [1]: dataset = sqlContext.read.parquet('indir')
In [2]: sizes = dataset.mapPartitions(lambda x: [len(list(x))]).collect()
In [3]: for item in sizes:
...: if(item == min(sizes)):
...: count = count + 1
...:
would not even finish after 20 minutes, and I know that the list sizes
is not that big, less than 205k in length. However this executed instantly:
In [8]: min_item = min(sizes)
In [9]: for item in sizes:
if(item == min_item):
count = count + 1
...:
So what happened?
My guess: python could not understand that min(sizes)
will be always constant, thus replace after the first few calls with its return value..since Python uses the interpreter..
Ref of min() doesn't say anything that would explain the matter to me, but what I was thinking is that it may be that it needs to look at the partitions to do that, but that shouldn't be the case, since sizes
is a list
, not an RDD
!
Edit:
Here is the source of my confusion, I wrote a similar program in C:
for(i = 0; i < SIZE; ++i)
if(i == mymin(array, SIZE))
++count;
and got these timings:
C02QT2UBFVH6-lm:~ gsamaras$ gcc -Wall main.c
C02QT2UBFVH6-lm:~ gsamaras$ ./a.out
That took 98.679177000 seconds wall clock time.
C02QT2UBFVH6-lm:~ gsamaras$ gcc -O3 -Wall main.c
C02QT2UBFVH6-lm:~ gsamaras$ ./a.out
That took 0.000000000 seconds wall clock time.
and for timings, I used Nomimal Animal's approach from my Time measurements.