6

I'm trying to "translate" some of my R scripts to Python, but I notice, that working with data frames in Python is tremendously slower than doing it in R, e.g. exctracting cells according to some conditions.

I've done a little investigation, this is how much time it takes to look for a specific value in Python:

import pandas as pd
from timeit import default_timer as timer

code = 145896

# real df is way bigger
df = pd.DataFrame(data={
    'code1': [145896, 800175, 633974, 774521, 416109],
    'code2': [100, 800, 600, 700, 400],
    'code3': [1, 8, 6, 7, 4]}
    )

start = timer()
for _ in range(100000):
    desired = df.loc[df['code1']==code, 'code2'][0]
print(timer() - start) # 19.866242500000226 (sec)

and in R:

code <- 145896

df <- data.frame("code1" = c(145896, 800175, 633974, 774521, 416109),
           "code2" = c(100, 800, 600, 700, 400),
           "code3" = c(1, 8, 6, 7, 4))

start <- Sys.time()
for (i in 1:100000) {
  desired <- df[df$code1 == code, "code2"]
}
print(Sys.time() - start) # Time difference of 1.140949 secs

I'm relatively new to Python, and I'm probably missing something. Is there some way to speed up this process? Maybe the whole idea of transferring this script to Python is pointless? In other operations Python is faster (namely working with strings), and it would be very inconvenient to jump between two or more scripts once working with data frames is required. Any help on this, please?

UPDATE Real script block iterates over rows of initial data frame (which is fairly large, 500-1500k rows) and creates a new one with rows, containing value from original column "code1" and codes, that correspond it, from another data frame, and many other values, that are newly created. I believe, I can clarify it with the picture: enter image description here

Later in the script I will need to search for specific values in loops based on different conditions too. So the speed of search is essential.

leonefamily
  • 71
  • 1
  • 1
  • 5
  • Is this picture not just depicting a merge? – ifly6 Jul 22 '21 at 21:39
  • You can actually say so by this pic, @ifly6, but it's a simplified (and maybe not the best) graphic representation of what is done. Calculating is row-wise, and next row often, but not always, depends on previous values and/or the code, so it's pretty unpredictable when it comes to merge. I have no problems with other values, but searching for the other codes is slow, and I want to improve this for current and further tasks – leonefamily Jul 22 '21 at 22:21
  • 1
    Well, the picture didn't make it more clear, but from your comment it does seem that you need to do, a `join`, and maybe some apply some `rolling` stuff, to the result. – Dimitry Jul 22 '21 at 22:45
  • Looping over Pandas or NumPy is kinda slow. Both are intended to be used with vectorized operations. – grofte Oct 18 '22 at 11:52

4 Answers4

13

Since you are looking to select a single value from a DataFrame there are a few things you can do to improve performance.

  1. Use .item() instead of [0], which has a small, but decent improvement especially for smaller DataFrames.
  2. It's wasteful to mask the entire DataFrame just to then select a known Series. Instead mask only the Series and select the value. Though you might think "oh this is chained -- the forbidden ][", it's only chained assignment which is worrisome, not chained selection.
  3. Use numpy. Pandas has a lot of overhead due to the indexing and alingment. But you just want to select a single value from a rectangular data structure, so dropping down to numpy will be faster.

Below are illustrations of the timing for different ways to select the data [Each with it's own method below]. Using numpy is by far the fastest, especially for a smaller DataFrame like in your sample. For those, it will be more than 20x faster than your original way to select data, which looking at your initial comparisons with R should make it slightly faster than selecting data in R. As the DataFrames get larger the relative performance of the numpy solution isn't as good, but it's still the fastest method (shown here).

import perfplot
import pandas as pd
import numpy as np

def DataFrame_Slice(df, code=0):
    return df.loc[df['code1'] == code, 'code2'].iloc[0]

def DataFrame_Slice_Item(df, code=0):
    return df.loc[df['code1'] == code, 'code2'].item()

def Series_Slice_Item(df, code=0):
    return df['code2'][df['code1'] == code].item()

def with_numpy(df, code=0):
    return df['code2'].to_numpy()[df['code1'].to_numpy() == code].item()


perfplot.show(
    setup=lambda N: pd.DataFrame({'code1': range(N),
                                  'code2': range(50, N+50),
                                  'code3': range(100, N+100)}),
    kernels=[
        lambda df: DataFrame_Slice(df),
        lambda df: DataFrame_Slice_Item(df),
        lambda df: Series_Slice_Item(df),
        lambda df: with_numpy(df)
    ],
    labels=['DataFrame_Slice', 'DataFrame_Slice_Item', 'Series_Slice_Item', 'with_numpy'],
    n_range=[2 ** k for k in range(1, 21)],
    equality_check=np.allclose,  
    relative_to=3,
    xlabel='len(df)'
)

enter image description here

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Using N=20 and creating the DataFrame, `%timeit %timeit with_numpy(df)` gives `1.04 ms ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)` which is quite a bit. – Tarik Mar 22 '22 at 06:21
  • `df.loc[df['code1'] == code, 'code2'][0]` fails for values other than 0. The following should be used instead: `df.loc[df['code1'] == code, 'code2'].iloc[0]` – Tarik Mar 22 '22 at 08:12
  • @Tarik yeah. Was the OPs code so I didn't pay much attention to it. – ALollz Mar 22 '22 at 13:28
  • with `numpy` the performance is better. – BetterCallMe Sep 19 '22 at 12:04
6

You can cut it in about half just by reusing the filter expression.

In [1]: import pandas as pd

In [2]: code = 145896
   ...: df = pd.DataFrame(data={
   ...:     'code1': [145896, 800175, 633974, 774521, 416109],
   ...:     'code2': [100, 800, 600, 700, 400],
   ...:     'code3': [1, 8, 6, 7, 4]
   ...: })

In [3]: %timeit df.loc[df['code1'] == code, 'code2'][0]
197 µs ± 5.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: filter_expr = df['code1'] == code

In [5]: %timeit df.loc[filter_expr, 'code2'][0]
106 µs ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Indexing the dataframe by the key column (assuming the lookup is frequent) should be the way to go because the dataframe's index is a hash table (see this answer and these slides for more details).

In [6]: df_idx = df.set_index('code1')

In [7]: %timeit df_idx.loc[code]['code2']
72.7 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And depending on other use cases that you have, having data a true embedded (in-memory) database, SQLite or DuckDB (can run queries directly on Pandas data without ever importing or copying any data), may also be a solution.

saaj
  • 23,253
  • 3
  • 104
  • 105
  • This doesn't make sense to me, why is that so, but if no faster solution offered, I'm using this tip. Sadly, it's still almost 10 times slower than in R... – leonefamily Jul 22 '21 at 20:12
  • @ALollz is right, I just used this example to show the computing time difference. In real script "code" variable is different every time. Data frames I refer have from 500k to 1,5 million rows, which makes completing this task awfully slow – leonefamily Jul 22 '21 at 20:21
  • @BigBen it was alternative to `timeit.timeit(...,number = 100000)` obviously. – Dimitry Jul 22 '21 at 20:21
  • Yup Index selection is definitely the way to go for repeated queries! – ALollz Jul 22 '21 at 21:01
5

One way to do it is to fall back on numpy by first getting the underlying arrays and then looping:

import pandas as pd
from timeit import default_timer as timer

code = 145896

# real df is way bigger
df = pd.DataFrame(data={
    'code1': [145896, 800175, 633974, 774521, 416109],
    'code2': [100, 800, 600, 700, 400],
    'code3': [1, 8, 6, 7, 4]}
    )

start = timer()


code1 = df['code1'].values
code2 = df['code2'].values
code3 = df['code3'].values

for _ in range(100000):
    desired = code1 == code
    desired_code2 = code2[desired][0]
    desired_code3 = code3[desired][0]

print(timer() - start) # 0.26 (sec)

The following code adapted from ALollz's answer illustrates the difference in performance of the different techniques with several more added and an increased dataset size range that shows changing behavior affected by orders of magnitude.

import perfplot
import pandas as pd
import numpy as np

def DataFrame_Slice(df, code=0):
    return df.loc[df['code1'] == code, 'code2'].iloc[0]

def DataFrame_Slice_Item(df, code=0):
    return df.loc[df['code1'] == code, 'code2'].item()

def Series_Slice_Item(df, code=0):
    return df['code2'][df['code1'] == code].item()

def with_numpy(df, code=0):
    return df['code2'].to_numpy()[df['code1'].to_numpy() == code].item()

def with_numpy_values(code1, code2, code=0):
    desired = code1 == code
    desired_code2 = code2[desired][0]
    return desired_code2

def DataFrameIndex(df, code=0):
    return df.loc[code].code2

def with_numpy_argmax(code1, code2, code=0):
    return code2[np.argmax(code1==code)]

def numpy_search_sorted(code1, code2, sorter, code=0):
    return code2[sorter[np.searchsorted(code1, code, sorter=sorter)]]

def python_dict(code1_dict, code2, code=0):
    return code2[code1_dict[code]]

def shuffled_array(N):
    a = np.arange(0, N)
    np.random.shuffle(a)
    return a

def setup(N):
    print(f'setup {N}')
    df = pd.DataFrame({'code1': shuffled_array(N),
                                  'code2': shuffled_array(N),
                                  'code3': shuffled_array(N)})
    
    code = df.iloc[min(len(df)//2, len(df)//2 + 20)].code1
    
    sorted_index = df['code1'].values.argsort()
    
    code1 = df['code1'].values
    code2 = df['code2'].values
    
    code1_dict = {code1[i]: i for i in range(len(code1))}
    
    return df, df.set_index('code1'), code1, code2, sorted_index, code1_dict, code


for relative_to in [5, 6, 7]:
    perfplot.show(
        setup=setup,
        kernels=[
            lambda df, _, __, ___, sorted_index, ____, code: DataFrame_Slice(df, code),
            lambda df, _, __, ___, sorted_index, ____, code: DataFrame_Slice_Item(df, code),
            lambda df, _, __, ___, sorted_index, ____, code: Series_Slice_Item(df, code),
            lambda df, _, __, ___, sorted_index, ____, code: with_numpy(df, code),
            lambda _, __, code1, code2, sorted_index, ____,  code: with_numpy_values(code1, code2, code),
            lambda _, __, code1, code2, sorted_index, ____, code: with_numpy_argmax(code1, code2, code),
            lambda _, df, __, ___, sorted_index, ____, code: DataFrameIndex(df, code),
            lambda _, __, code1, code2, search_index, ____, code: numpy_search_sorted(code1, code2, search_index, code),
            lambda _, __, ___, code2, _____, code1_dict, code: python_dict(code1_dict, code2, code)
        ],
        logy=True,
        labels=['DataFrame_Slice', 'DataFrame_Slice_Item', 'Series_Slice_Item', 'with_numpy', 'with_numpy_values', 'with_numpy_argmax', 'DataFrameIndex', 'numpy_search_sorted', 'python_dict'],
        n_range=[2 ** k for k in range(1, 25)],
        equality_check=np.allclose,  
        relative_to=relative_to,
        xlabel='len(df)')

Conclusion: Numpy search sorted is the best performing technique except for very small data sets below 100 mark. Sequential search using the underlying numpy arrays is the next best option for datasets roughly below 100,000 after which the best option is to use the DataFrame index. However, when hitting the multi-million records mark, DataFrame indexes no longer perform well, probably due to hashing collisions.

[EDIT 03/24/2022] Using a Python dictionary beats all other techniques by at least one order of magnitude.

Figure 1 Figure 2 Figure 3

Note: We are assuming repeated searches within the DataFrame, so as to offset the cost of acquiring the underlying numpy arrays, indexing the DataFrame or sorting the numpy array.

Tarik
  • 10,810
  • 2
  • 26
  • 40
0

R-lang was probably designed around this kind of operations. Pandas is a third-party library in python, so it has additional limitations and overheads to work around. Namely, upon each intermediate step a new dataframe or series is generated. Almost each equals sign or a bracket is an intermediate step.

If you really want to extract single element at a time from a single column, you can try setting index:

df2 = df.set_index('code1')
def getel(df2,code):
    desired = None
    if code in df2.index: 
        desired = df2['code2'][code]
        if isinstance(desired, pd.Series):
            desired = desired.iloc[0]
    return code

This is three times faster than original if the value is duplicated. If the value is unique, then desired = df2['code2'][code] does not generate new series and the code is 17 times faster than original. Also, notice that selecting columns before everything else tends to be a notable reduction in the unnecessary work pandas do.

If you instead want to do similar operations on different values, then you should probably look at groupby. Or at least filter all the values to be processed at once:

codes = {145896,774521}
pad = df['code1'].apply(lambda x: x in codes) #this is good when there are many codes
#or
pad = df['code1'].isin(codes) #this is linear time in len(codes)
result = df[pad].apply(do_stuff, axis = 1)
#or df[pad].transform(do_stuff, axis = 1)
Dimitry
  • 2,204
  • 1
  • 16
  • 24