6

Coming from the beautiful world of , I am trying understand this behavior:

In [1]: dataset = sqlContext.read.parquet('indir')
In [2]: sizes = dataset.mapPartitions(lambda x: [len(list(x))]).collect()
In [3]: for item in sizes:
   ...:     if(item == min(sizes)):
   ...:         count = count + 1
   ...:         

would not even finish after 20 minutes, and I know that the list sizes is not that big, less than 205k in length. However this executed instantly:

In [8]: min_item = min(sizes)

In [9]: for item in sizes:
    if(item == min_item):
        count = count + 1
   ...:         

So what happened?

My guess: could not understand that min(sizes) will be always constant, thus replace after the first few calls with its return value..since Python uses the interpreter..


Ref of min() doesn't say anything that would explain the matter to me, but what I was thinking is that it may be that it needs to look at the partitions to do that, but that shouldn't be the case, since sizes is a list, not an RDD!


Edit:

Here is the source of my confusion, I wrote a similar program in C:

for(i = 0; i < SIZE; ++i)
    if(i == mymin(array, SIZE))
        ++count;

and got these timings:

C02QT2UBFVH6-lm:~ gsamaras$ gcc -Wall main.c
C02QT2UBFVH6-lm:~ gsamaras$ ./a.out
That took 98.679177000 seconds wall clock time.
C02QT2UBFVH6-lm:~ gsamaras$ gcc -O3 -Wall main.c
C02QT2UBFVH6-lm:~ gsamaras$ ./a.out
That took 0.000000000 seconds wall clock time.

and for timings, I used Nomimal Animal's approach from my Time measurements.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • 1
    The first code is `O(n*n)`, the second code is `O(n)`. How does this support the hypothesis? – user2864740 Aug 05 '16 at 21:55
  • 1
    CPython only does really simplistic optimizations. The dynamic nature of the language also makes many optimizations impossible: for example, imagine if some other code did `min = lambda x: 1`. – Colonel Thirty Two Aug 05 '16 at 21:56
  • 3
    There is no non-pure language I know of that would even try to "understand" this optimization. For it to even be valid would require a guarantee of deterministic behavior. – user2864740 Aug 05 '16 at 21:56
  • 4
    I wouldn't even trust a C compiler to perform this kind of optimization, let alone Python. There are too many weird possibilities the optimizer has to account for. – user2357112 Aug 05 '16 at 21:57
  • @user2864740 to you say it because of `[1]` and [`2`]? – gsamaras Aug 05 '16 at 21:58
  • `min` is expressed as function containing a loop 0..n. In the first example `min` is itself contained in a loop 0..n. – user2864740 Aug 05 '16 at 21:59
  • Yes I agree @user2864740, that's what I am saying..If Python doesn't do anything about it, it will be `O(n²)`, but that's what I am asking.. :) – gsamaras Aug 05 '16 at 22:01
  • "..[an algorithm of O(n^2)] would not even finish after 20 minutes.. [but one of O(n)] executed instantly.." Seems like a legit observation for some values of 'n' and 'instantly'. – user2864740 Aug 05 '16 at 22:01
  • Correct, so Python executes `min()` at every loop, that's what I was asking! Thanks @user2864740, are you going to answer or should I delete the question? – gsamaras Aug 05 '16 at 22:04
  • It is also possible to self-answer a question. If this question (and answer with relevant information/clarification/reasoning) would make an interesting find, then keep it. – user2864740 Aug 05 '16 at 22:05
  • @user2864740 I am pretty sure a guy like me wound not provide such a nice answer as you, for various reasons on this occasion...I updated my question with the source of confusion, which might be something to keep it alive. Will wait for someone to answer and if not, I will **try** to answer. – gsamaras Aug 05 '16 at 22:37

1 Answers1

5

I'm by no means an expert on the inner workings of python, but from my understanding thus far you'd like to compare the speed of

for item in sizes:
    if(item == min(sizes)):
        count = count + 1

and

min_item = min(sizes)
for item in sizes:
    if(item == min_item):
        count = count + 1

Now someone correct me if I have any of this wrong but,

In python lists are mutable and do not have a fixed length, and are treated as such, while in C an array has a fixed size. From this question:

Python lists are very flexible and can hold completely heterogeneous, arbitrary data, and they can be appended to very efficiently, in amortized constant time. If you need to shrink and grow your array time-efficiently and without hassle, they are the way to go. But they use a lot more space than C arrays.

Now take this example

for item in sizes:
    if(item == min(sizes)):
        new_item = item - 1
        sizes.append(new_item)

Then the value of item == min(sizes) would be different on the next iteration. Python doesn't cache the resulting value of min(sizes) since it would break the above example, or require some logic to check if the list has been changed. Instead it leaves that up to you. By defining min_item = min(sizes) you are essentially caching the result yourself.

Now since the array is a fixed size in C, it can find the min value with less overhead than a python list, thus why I think it has no problem in C (as well as C being a much lower level language).

Again, I don't fully understand the underlying code and compilation for python, and I'm certain if you analyzed the process of the loops in python, you'd see python repeatedly computing min(sizes), causing the extreme amount of lag. I'd love to learn more about the inner workings of python (for example, are any methods cached in a loop for python, or is everything computed again for each iteration?) so if anyone has more info and/or corrections, let me know!

Community
  • 1
  • 1
Xander Luciano
  • 3,753
  • 7
  • 32
  • 53
  • While you have a point and I accepted your answer, be warned that I don't think it's 100% clear. For example I just did the same think with an `std::vector` and got 115.9 sec. and 8.4 sec, which shows a dramatic speedup, despite of the flexibility of the vector. So, I would say that it's rather a [tag:python] thing, rather than a matter the data-structure's flexibility.. – gsamaras Aug 06 '16 at 01:20