10

I have a DataFrame df with 541 columns, and I need to save all unique pairs of its column names into the rows of a separate DataFrame, repeated 8 times each.

I thought I would create an empty DataFrame fp, double loop through df's column names, insert into every 8th row, and fill in the blanks with the last available value.

When I tried to do this though I was baffled by how long it's taking. With 541 columns I only have to write 146,611 times yet it's taking well over 20 minutes. This seems egregious for just data access. Where is the problem and how can I solve it? It takes less time than that for Pandas to produce a correlation matrix with the columns so I must me doing something wrong.

Here's a reproducible example of what I mean:

fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)

fp = pd.DataFrame(fp)

%timeit for idx in range(0, len(fp)): fp.iloc[idx, 0] = idx

# 1 loop, best of 3: 22.3 s per loop
badmax
  • 645
  • 2
  • 5
  • 12

2 Answers2

21

Don't do iloc/loc/chained-indexing. Using the NumPy interface alone increases speed by ~180x. If you further remove element access, we can bump this to 180,000x.

fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)

fp = pd.DataFrame(fp)

# this confirms how slow data access is on my computer
%timeit for idx in range(0, len(fp)): fp.iloc[idx, 0] = idx

1 loops, best of 3: 3min 9s per loop

# this accesses the underlying NumPy array, so you can directly set the data
%timeit for idx in range(0, len(fp)): fp.values[idx, 0] = idx

1 loops, best of 3: 1.19 s per loop

This is because there's extensive code that goes in the Python layer for this fancing indexing, taking ~10µs per loop. Using Pandas indexing should be done to retrieve entire subsets of data, which you then use to do vectorized operations on the entire dataframe. Individual element access is glacial: using Python dictionaries will give you a > 180 fold increase in performance.

Things get a lot better when you access columns or rows instead of individual elements: 3 orders of magnitude better.

# set all items in 1 go.
%timeit fp[0] = np.arange(146611)
1000 loops, best of 3: 814 µs per loop

Moral

Don't try to access individual elements via chained indexing, loc, or iloc. Generate a NumPy array in a single allocation, from a Python list (or a C-interface if performance is absolutely critical), and then perform operations on entire columns or dataframes.

Using NumPy arrays and performing operations directly on columns rather than individual elements, we got a whopping 180,000+ fold increase in performance. Not too shabby.

Edit

Comments from @kushy suggest Pandas may have optimized indexing in certain cases since I originally wrote this answer. Always profile your own code, and your mileage may vary.

Alex Huszagh
  • 13,272
  • 3
  • 39
  • 67
  • 4
    As of Aug 2018, I have to disagree and say that iloc for setting values is much faster than the access via values. Had an expression in a for loop which was basically `df[x][y] += 1`. Did it once with direct access, i.e. `df[specifier_name][y]+=1`, once using values, i.e. `df.values[y, x]+=1` and once with iloc, i.e. `df.iloc[y, x]+=1` and speedwise, it was iloc > direct access >>> values. I don't know why. – kushy Aug 01 '18 at 14:43
  • 2
    Edit: Depends on other factors too, apparently. Made a small script for testing and there, to my surprise, iloc was slower than values. If someone's interested, I could provide an outline of my initial problem. – kushy Aug 01 '18 at 14:54
  • 2
    @kushy This may have been optimized and no longer be relevant in all cases, which would be excellent. I'll add a disclaimer with a link to your comment in my answer. – Alex Huszagh Aug 02 '18 at 16:57
4

Alexander's answer was the fastest for me as of 2020-01-06 when using .is_numpy() instead of .values. Tested in Jupyter Notebook on Windows 10. Pandas version = 0.24.2

import numpy as np 
import pandas as pd
fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)
fp = pd.DataFrame(fp)
pd.__version__ # '0.24.2'

def func1():
    # Asker badmax solution
    for idx in range(0, len(fp)): 
        fp.iloc[idx, 0] = idx

def func2():
    # Alexander Huszagh solution 1
    for idx in range(0, len(fp)):
        fp.to_numpy()[idx, 0] = idx

def func3():
    # user4322543 answer to
    # https://stackoverflow.com/questions/34855859/is-there-a-way-in-pandas-to-use-previous-row-value-in-dataframe-apply-when-previ
    new = []
    for idx in range(0, len(fp)):
        new.append(idx)
    fp[0] = new

def func4():
    # Alexander Huszagh solution 2
    fp[0] = np.arange(146611)

%timeit func1
19.7 ns ± 1.08 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func2
19.1 ns ± 0.465 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func3
21.1 ns ± 3.26 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func4
24.7 ns ± 0.889 ns per loop (mean ± std. dev. of 7 runs, 50000000 loops each)
Viragos
  • 561
  • 6
  • 15
  • 1
    Since this is relatively recent `.values` should be replaced with `.to_numpy()`, as suggested [in the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html). – AMC Jan 04 '20 at 22:11
  • Thanks @AMC, changed the results of the test such that function 2 was now the fastest instead of function 1 – Viragos Jan 07 '20 at 18:55
  • 1
    I just ran some tests myself on a (50000,15) dataset. For me, both iat and iloc run drastically faster as of version 1.4.1 – Jeroen Vermunt Feb 25 '22 at 20:31