I'm trying to understand what's the execution complexity of the iloc
function in pandas.
I read the following Stack Exchange thread (Pandas DataFrame search is linear time or constant time?) that:
"accessing single row by index (index is sorted and unique) should have runtime O(m) where
m << n_rows
"
mentioning that iloc
runs on O(m)
time. What is m
(linear, log, constant,...)?
Some experiments I ran:
import pandas as pd
>>> a = pd.DataFrame([[1,2,3],[1,3,4],[2,3,4],[2,4,5]], columns=['a','b','c'])
>>> a = a.set_index('a').sort_index()
>>> a
b c
a
1 3 4
1 4 5
2 2 3
2 3 4
>>> a.iloc[[0,1,2,3]]
b c
a
1 3 4
1 4 5
2 2 3
2 3 4
So iloc
clearly works with offsets and not on the integer-based index (column a
). Even if we delete few rows at the top, the iloc
offset-based lookup works correctly:
>>> a.drop([1]).iloc[[0,1]]
b c
a
2 2 3
2 3 4
So why isn't iloc
offset-lookup running on a comparable time to numpy arrays when each column is simply a numpy array that can be accessed in constant time (few operations)? And what's its complexity?
UPDATE:
I tried to compare the efficiency of pandas vs numpy on a 10000000x2 matrix. Comparing the efficiency of a value increment per row in a DataFrame df
and an array arr
, with and without a for
loop:
# Initialization
SIZE = 10000000
arr = np.ones((SIZE,2), dtype=np.uint32)
df = pd.DataFrame(arr)
# numpy, no for-loop
arr[range(SIZE),1] += 1
# pandas, no for-loop
df.iloc[range(SIZE),1] += 1
# numpy, for-loop
for i in range(SIZE):
arr[i,1] += 1
# pandas, for-loop
for i in range(SIZE):
df.iloc[i,1] += 1
Method | Execution time |
---|---|
numpy, no for-loop | 7 seconds |
pandas, no for-loop | 24 seconds |
numpy, with for-loop | 27 seconds |
pandas, with for-loop | > 2 hours |