What is the optimal way to assign a value to a pandas DataFrame column from a column in a different row?

Question

I need to iterate over a DataFrame indexed by UNIX timestamp, and in one column, assign a value from another column in a different row at a specific index time in the future. This is what I'm currently doing:

df = pd.DataFrame([
    [1523937600, 100.0, 0.0], 
    [1523937660, 120.0, 0.0], 
    [1523937720, 110.0, 0.0],
    [1523937780, 90.0, 0.0],
    [1523937840, 99.0, 0.0]], 
    columns=['time', 'value', 'target'])
df.set_index('time', inplace=True)

skip = 2  # mins skip-ahead
for i in range(0, df.shape[0]-1):       
    t = df.index[i] + (60*skip)
    try:
        df.iloc[i].target = df.loc[t].value
    except KeyError:
        df.iloc[i].target = 0.0

Output:

            value  target
time                     
1523937600  100.0   110.0
1523937660  120.0    90.0
1523937720  110.0    99.0
1523937780   90.0     0.0
1523937840   99.0     0.0

This works, but I am dealing with datasets containing millions of rows and it takes an extremely long time. Is there a more optimal way to do this?

EDIT: Added example input/output. Note, it is important that I obtain the value from the row with the calculated index time rather than just look ahead n rows, as there could be gaps in the times, or additional times in between.

Please provide a sample input and expected output to make a [mcve] so that we can better understand your issue. See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — G. Anderson, Feb 23 '21 at 21:48

Inputvector · Answer 1 · 2021-02-24T05:20:44.583

0

In this case you should keep time as a column as well as index. Hope this helps:

df = pd.DataFrame([ 
    [1523937600, 100.0, 0.0], 
    [1523937660, 120.0, 0.0], 
    [1523937720, 110.0, 0.0],
    [1523937780, 90.0, 0.0],
    [1523937840, 99.0, 0.0]], 
    columns=['time', 'value', 'target'])
df.index = df['time']

df['target'] = df['time'].apply(lambda x: df.loc[x+(skip*60)].value if x+(skip*60) in df.index.values  else 0.0)

edited Feb 24 '21 at 05:20

answered Feb 24 '21 at 00:36

Inputvector

1,061
10
22

On a sample of 100k rows this performs 5x slower for me (~60s vs ~12s). I'm new to pandas, but looking at the code, I wouldn't expect this to be faster - for each row you are doing a value search twice in your slice to find the index (2 calls to the idx function), whereas in the iteration code it's a direct lookup. – amnesia Feb 24 '21 at 02:19
Previous code was much faster for me. I think rows number affect it negative. I updated the code but you need to keep time as a column as well as index in this case. Let me know about the performance if you try. – Inputvector Feb 24 '21 at 05:23
Got pulled to another project and now revisiting this. Your update does improve over your previous code and brings a 100k sample down to ~15s, so quite a savings but still not faster than the iteration code I have. I appreciate the effort though. – amnesia Mar 12 '21 at 00:34

What is the optimal way to assign a value to a pandas DataFrame column from a column in a different row?

1 Answers1