1

I'm new to python and very new to Pandas. I've looked through the Pandas documentation and tried multiple ways to solve this problem unsuccessfully.

I have a DateFrame with timestamps in one column and prices in another, such as:

d = {'TimeStamp': [1603822620000, 1603822680000,1603822740000, 1603823040000,1603823100000,1603823160000,1603823220000], 'Price': [101,105,102,108,105,101,106], 'OtherData1': [1,2,3,4,5,6,7], 'OtherData2': [7,6,5,4,3,2,1]}
df= pd.DataFrame(d)
df
       TimeStamp  Price  OtherData1  OtherData2
0  1603822620000    101           1           7
1  1603822680000    105           2           6
2  1603822740000    102           3           5
3  1603823040000    108           4           4
4  1603823100000    105           5           3
5  1603823160000    101           6           2
6  1603823220000    106           7           1

In addition to the two columns of interest, this DataFrame also has additional columns with data not particularly relevant to the question (represented with OtherData Cols).

I want to create a new column 'Fut2Min' (Price Two Minutes into the Future). There may be missing data, so this problem can't be solved by simply getting the data from 2 rows below.

I'm trying to find a way to make the value for Fut2Min Col in each row == the Price at the row with the timestamp + 120000 (2 minutes into the future) or null (or NAN or w/e) if the corresponding timestamp doesn't exist.

For the example data, the DF should be updated to: (Code used to mimic desired result)


d = {'TimeStamp': [1603822620000, 1603822680000, 1603822740000, 1603822800000, 1603823040000,1603823100000,1603823160000,1603823220000], 
     'Price': [101,105,102,108,105,101,106,111], 
     'OtherData1': [1,2,3,4,5,6,7,8], 
     'OtherData2': [8,7,6,5,4,3,2,1],
    'Fut2Min':[102,108,'NaN','NaN',106,111,'NaN','NaN']}
df= pd.DataFrame(d)
df
       TimeStamp  Price  OtherData1  OtherData2 Fut2Min
0  1603822620000    101           1           8     102
1  1603822680000    105           2           7     108
2  1603822740000    102           3           6     NaN
3  1603822800000    108           4           5     NaN
4  1603823040000    105           5           4     106
5  1603823100000    101           6           3     111
6  1603823160000    106           7           2     NaN
7  1603823220000    111           8           1     NaN
MrMagooMan
  • 161
  • 2
  • 5

2 Answers2

1

EDIT 2: I have updated the solution since it had some sloppy parts (exchanged the list for index determination with a dictionary and restricted the search for timestamps).

This (with import numpy as np)

indices = {ts - 120000: i for i, ts in enumerate(df['TimeStamp'])}
df['Fut2Min'] = [
    np.nan
    if (ts + 120000) not in df['TimeStamp'].values[i:] else
    df['Price'].iloc[indices[ts]]
    for i, ts in enumerate(df['TimeStamp'])
]

gives you

       TimeStamp  Price  Fut2Min
0  1603822620000    101    102.0
1  1603822680000    105    108.0
2  1603822740000    102      NaN
3  1603822800000    108      NaN
4  1603823040000    105    106.0
5  1603823100000    101    111.0
6  1603823160000    106      NaN
7  1603823220000    111      NaN

But I'm not sure if that is an optimal solution.


EDIT: Inspired by the discussion in the comments I did some timing:

With the sample frame

from itertools import accumulate
import numpy as np

rng = np.random.default_rng()
n = 10000
timestamps = [1603822620000 + t
              for t in accumulate(rng.integers(1, 4) * 60000 
                                  for _ in range(n))]
df = pd.DataFrame({'TimeStamp': timestamps, 'Price': n * [100]})
          TimeStamp  Price
0     1603822680000    100
...             ...    ...
9999  1605030840000    100

[10000 rows x 2 columns]

and the two test functions

# (1) Other solution
def Fut2MinFunc(row):
    futTimeStamp = row.TimeStamp + 120000
    if (futTimeStamp in df.TimeStamp.values):
        return df.loc[df['TimeStamp'] == futTimeStamp, 'Price'].iloc[0]
    else:
        return None

def test_1():
    df['Fut2Min'] = df.apply(Fut2MinFunc, axis = 1)

# (2) Solution here
def test_2():
    indices = list(df['TimeStamp'] - 120000)
    df['Fut2Min'] = [
        np.nan
        if (timestamp + 120000) not in df['TimeStamp'].values else
        df['Price'].iloc[indices.index(timestamp)]
        for timestamp in df['TimeStamp']
    ]

I conducted the experiment

from timeit import timeit

t1 = timeit('test_1()', number=100, globals=globals())
t2 = timeit('test_2()', number=100, globals=globals())
print(t1, t2)

with the result

135.98962861 40.306039344

which seems to imply that the version here is faster? (I also measured directly with time() and without the wrapping in functions and the results are virtually identical.)

With my updated version the result looks like

139.33713767799998 14.178187169000012

I finally did one try with a frame with 1,000,000 rows (number=1) and the result was

763.737430931 175.73120002400003
Timus
  • 10,974
  • 5
  • 14
  • 28
  • Thanks a lot for the response. I think that would solve my problem, but ozturkib's solution seems a bit better (at least as far as I can tell as someone new to python & pandas). Regardless, very much appreciated. – MrMagooMan Oct 29 '20 at 15:11
  • @MrMagooMan Thanks for the feedback, it's _your_ choice ;) I always try to follow the second answer [here](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas) which favours list comprehension over `apply()`. But that isn't ment to say that his solution is bad. – Timus Oct 29 '20 at 15:18
  • Thanks for the link. I'll look into that – MrMagooMan Oct 29 '20 at 15:20
  • @MrMagooMan I have compared two methods (my one vs this one) in terms of the performance as well. The one I have placed there is almost two times faster than the one offered here. You can check it on your side as well. Elapsed times are 0.00413 seconds for my one and 0.00775 seconds for this solution. – ozturkib Oct 29 '20 at 16:16
  • @ozturkib Interesting! Since I wasn't convinced I also did an experiment (see the edit of my answer). The result seems to contradict yours. But maybe I made a mistake. I'd be happy if you could check it. – Timus Oct 29 '20 at 18:51
1

Assuming that the DataFrame is:

       TimeStamp  Price  OtherData1  OtherData2  Fut2Min
0  1603822620000    101           1           8        0
1  1603822680000    105           2           7        0
2  1603822740000    102           3           6        0
3  1603822800000    108           4           5        0
4  1603823040000    105           5           4        0
5  1603823100000    101           6           3        0
6  1603823160000    106           7           2        0
7  1603823220000    111           8           1        0

Then, if you use pandas.DataFrame.apply, along the column axis:

import pandas as pd

def Fut2MinFunc(row):
    futTimeStamp = row.TimeStamp + 120000
    if (futTimeStamp in df.TimeStamp.values):
        return df.loc[df['TimeStamp'] == futTimeStamp, 'Price'].iloc[0]
    else:
        return None

df['Fut2Min'] = df.apply(Fut2MinFunc, axis = 1)

You will get exactly what you describe as:

       TimeStamp  Price  OtherData1  OtherData2  Fut2Min
0  1603822620000    101           1           8    102.0
1  1603822680000    105           2           7    108.0
2  1603822740000    102           3           6      NaN
3  1603822800000    108           4           5      NaN
4  1603823040000    105           5           4    106.0
5  1603823100000    101           6           3    111.0
6  1603823160000    106           7           2      NaN
7  1603823220000    111           8           1      NaN
ozturkib
  • 1,493
  • 16
  • 28
  • @MrMagooMan Thanks for the feedback. I try to make it efficient avoiding from iterations with taking into account the overall level of knowledge on pandas as well. – ozturkib Oct 29 '20 at 15:40