dataframe.str[start:stop] where start and stop are columns in same data frame

Question

I would like to use pandas.str to vectorize slice operation on pandas column which values are list and start and stop values are int values in start and stop columns of same dataframe example:

df['column_with_list_values'].str[start:stop]

df[['list_values', 'start', 'stop']]
                list_values  start  stop
0              [5, 7, 6, 8]           0               2
1  [1, 3, 5, 7, 2, 4, 6, 8]           1               3
2  [1, 3, 5, 7, 2, 4, 6, 8]           0               2
3  [1, 3, 5, 7, 2, 4, 6, 8]           0               2
4  [1, 3, 5, 7, 2, 4, 6, 8]           1               3
5  [1, 3, 5, 7, 2, 4, 6, 8]           2               4
6  [1, 3, 5, 7, 2, 4, 6, 8]           0               2

and result would be
    0    [5, 7]
    1    [3, 5]
    2    [1, 3]
    3    [1, 3]
    4    [3, 5]
    5    [5, 7]
    6    [1, 3]

Thanks!

`{'list_values': {0: [5, 7, 6, 8], 1: [1, 3, 5, 7, 2, 4, 6, 8], 2: [1, 3, 5, 7, 2, 4, 6, 8], 3: [1, 3, 5, 7, 2, 4, 6, 8], 4: [1, 3, 5, 7, 2, 4, 6, 8], 5: [1, 3, 5, 7, 2, 4, 6, 8], 6: [1, 3, 5, 7, 2, 4, 6, 8]}, 'start': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 2, 6: 0}, 'stop': {0: 2, 1: 3, 2: 2, 3: 2, 4: 3, 5: 4, 6: 2}}` — BeRT2me, Jun 18 '22 at 21:29

BeRT2me · Answer 1 · 2022-06-19T01:40:05.820

df.apply(lambda x: x.list_values[x.start:x.stop], axis=1)

Output:

0    [5, 7]
1    [3, 5]
2    [1, 3]
3    [1, 3]
4    [3, 5]
5    [5, 7]
6    [1, 3]
dtype: object

I'm not sure why, but the fastest variation appears to be:

df['sliced'] = [lst[start:stop] for lst, start, stop in zip(df.list_values.tolist(), df.start.tolist(), df.stop.tolist())]

My testing:

df = pd.DataFrame({'list_values': {0: [5, 7, 6, 8], 1: [1, 3, 5, 7, 2, 4, 6, 8], 2: [1, 3, 5, 7, 2, 4, 6, 8], 3: [1, 3, 5, 7, 2, 4, 6, 8], 4: [1, 3, 5, 7, 2, 4, 6, 8], 5: [1, 3, 5, 7, 2, 4, 6, 8], 6: [1, 3, 5, 7, 2, 4, 6, 8]}, 'start': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 2, 6: 0}, 'stop': {0: 2, 1: 3, 2: 2, 3: 2, 4: 3, 5: 4, 6: 2}})
df = pd.concat([df]*100000)
# Shape is now (700000, 3)

def v1(df):
    temp = df.copy()
    temp['sliced'] = [lst[start:stop] for lst, start, stop in temp.values.tolist()]

def v2(df):
    temp = df.copy()
    temp['sliced'] = [lst[start:stop] for lst, start, stop in zip(temp.list_values, temp.start, temp.stop)]

def v3(df):
    temp = df.copy()
    temp['sliced'] = [lst[start:stop] for lst, start, stop in temp.values]

def v4(df):
    temp = df.copy()
    temp['sliced'] = [lst[start:stop] for lst, start, stop in zip(df.list_values.tolist(), df.start.tolist(), df.stop.tolist())]

def v5(df):
    temp = df.copy()
    temp['sliced'] = temp.apply(lambda x: x.list_values[x.start:x.stop], axis=1)

%timeit -n 10 v1(df)
%timeit -n 10 v2(df)
%timeit -n 10 v3(df)
%timeit -n 10 v4(df)
%timeit v5(df)

Output:

# v1: temp.values.tolist()
235 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# v2: zip(temp.list_values, temp.start, temp.stop)
249 ms ± 9.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# v3: temp.values
578 ms ± 6.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# v4: zip(df.list_values.tolist(), df.start.tolist(), df.stop.tolist())
149 ms ± 8.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# v5: apply
12.1 s ± 165 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But yes, the list comprehension method, no matter what variation, is significantly faster than using apply.

Third update:

I figured out how to sort of vectorize this problem, using groupby and transform. Still not quite as good as the best list comprehension in my testing, but pretty darn good.

def v6(df):
    temp = df.copy()
    temp['sliced'] = temp.groupby(['start','stop'])['list_values'].transform(lambda x: x.str[x.name[0]:x.name[1]])

# v6: groupby
256 ms ± 5.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Thanks for help! This works but I think this is not vectorized. Pandas apply is just another loop. For many rows this would be rather slow I think — Ivan Stimac, Jun 18 '22 at 20:35
@IvanStimac you are right, but unfortunately `pandas` currently doesn't offer a way of vectorizing slice operations. If you are working with a big dataset, instead of `DataFrame.apply` you can try to use a list comprehension with the same logic. It can be a good improvement due to the overhead that `DataFrame.apply` introduces — Rodalm, Jun 18 '22 at 20:40
@Rodalm thanks for the tip. Actually I think df.str provide set of vectorized string operations and this works on lists also. For me the problem is that start and stop indexes are not fixed — Ivan Stimac, Jun 18 '22 at 20:45
@IvanStimac you should know that the `Series.str` methods [are not truly vectorized](https://stackoverflow.com/a/50744448/17120692). It's just a loop under the hood too, they are just convenient. Either way, [`Series.str.slice`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.slice.html) doesn't accept `Series` as arguments, just integers — Rodalm, Jun 18 '22 at 20:48
@Rodlam I didn't know this. Thanks for the input I will research this more. — Ivan Stimac, Jun 18 '22 at 20:59
@IvanStimac no worries, I'm glad I could help :) If you find something interesting about this topic, please let me know! Cheers — Rodalm, Jun 18 '22 at 21:00

Rodalm · Accepted Answer · 2022-06-18T23:24:25.743

2

You can try a list comprehension. It should be faster than using DataFrame.apply for big datasets, since DataFrame.apply introduces a big overhead.

df['sliced'] = [lst[start:stop] for lst, start, stop in zip(df.list_values, df.start, df.stop)]

Output:

>>> df

                list_values  start  stop  sliced
0              [5, 7, 6, 8]      0     2  [5, 7]
1  [1, 3, 5, 7, 2, 4, 6, 8]      1     3  [3, 5]
2  [1, 3, 5, 7, 2, 4, 6, 8]      0     2  [1, 3]
3  [1, 3, 5, 7, 2, 4, 6, 8]      0     2  [1, 3]
4  [1, 3, 5, 7, 2, 4, 6, 8]      1     3  [3, 5]
5  [1, 3, 5, 7, 2, 4, 6, 8]      2     4  [5, 7]
6  [1, 3, 5, 7, 2, 4, 6, 8]      0     2  [1, 3]

Look at BeRT2me benchmark for possible further improvements.

edited Jun 18 '22 at 23:24

answered Jun 18 '22 at 20:50

Rodalm

5,169
5
21

1

I was curious, so I ran some tests and found a variation of yours that is slightly faster~ – BeRT2me Jun 18 '22 at 22:50
1

@BeRT2me Well done, thank you for doing the benchmark! :) I think using`df.values.tolist()` is faster than using `zip(df.list_values, df.start, df.stop)` because iterating over Series is slower than iterating over lists – Rodalm Jun 18 '22 at 23:20
2

@BeRT2me is indeed very odd that `zip(*df.to_dict('list').values())` is even faster... I can't explain it either – Rodalm Jun 18 '22 at 23:25

dataframe.str[start:stop] where start and stop are columns in same data frame

2 Answers2