How to set ranges of rows in pandas?

Question

I have the following working code that sets 1 to "new_col" at the locations pointed by intervals dictated by starts and ends.

import pandas as pd
import numpy as np


df = pd.DataFrame({"a": np.arange(10)})

starts = [1, 5, 8]
ends = [1, 6, 10]

value = 1
df["new_col"] = 0

for s, e in zip(starts, ends):
    df.loc[s:e, "new_col"] = value

print(df)

   a  new_col
0  0        0
1  1        1
2  2        0
3  3        0
4  4        0
5  5        1
6  6        1
7  7        0
8  8        1
9  9        1

I want these intervals to come from another dataframe pointer_df.

How to vectorize this?

pointer_df = pd.DataFrame({"starts": starts, "ends": ends})

Attempt:

df.loc[pointer_df["starts"]:pointer_df["ends"], "new_col"] = 2
print(df)

obviously doesn't work and gives

    raise AssertionError("Start slice bound is non-scalar")
AssertionError: Start slice bound is non-scalar

EDIT:

it seems all answers use some kind of pythonic for loop.

the question was how to vectorize the operation above?

Is this not doable without for loops/list comprehentions?

Dani Mesejo · Accepted Answer · 2020-12-01T21:48:55.260

You could do:

pointer_df = pd.DataFrame({"starts": starts, "ends": ends})

rang = np.arange(len(df))
indices = [i for s, e in pointer_df.to_numpy() for i in rang[slice(s, e + 1, None)]]

df.loc[indices, 'new_col'] = value
print(df)

Output

   a  new_col
0  0        0
1  1        1
2  2        0
3  3        0
4  4        0
5  5        1
6  6        1
7  7        0
8  8        1
9  9        1

If you want a method that do not uses uses any for loop or list comprehension, only relies on numpy, you could do:

def indices(start, end, ma=10):
    limits = end + 1
    lens = np.where(limits < ma, limits, end) - start
    np.cumsum(lens, out=lens)
    i = np.ones(lens[-1], dtype=int)
    i[0] = start[0]
    i[lens[:-1]] += start[1:]
    i[lens[:-1]] -= limits[:-1]
    np.cumsum(i, out=i)
    return i


pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
df.loc[indices(pointer_df.starts.values, pointer_df.ends.values, ma=len(df)), "new_col"] = value
print(df)

I adapted the method to your use case from the one in this answer.

score 0 · Answer 2 · answered Dec 01 '20 at 17:43

0

for i,j in zip(pointer_df["starts"],pointer_df["ends"]):
    print (i,j)

Apply same method but on your dictionary

answered Dec 01 '20 at 17:43

ombk

2,036
1
4
16

this is just a copy paste of what I did, just doesn't solve the problem – Gulzar Dec 01 '20 at 20:56

How to set ranges of rows in pandas?

2 Answers2