0

I have a temporal KDF kernel as a list (or numpy array) of values, where value index represents corresponding minute in a week.

my data is approximate as described below: - kde: list or ndarray of float values, with the length of 7*24*60. - df: DataFrame with ~ 50 columns of different types, including timestamp column with integer values within the range (0 to 7*24*60-1). Dataframe has ~ 2000000 records.

as a sample:

col1|col2|...|col49|timestamp
1   | 2  |...| 49  |  15
2   | 3  |...| 50  |  16

My desired output should be the very same dataframe, with kd column, including corresponding values from kde. In other words, for each record in the data frame, I need to get KDE value using record timestamp. I need to do it as fast as possible.

Desired outcome:

col1|col2|...|col49|timestamp | kd
1   | 2  |...| 49  |  15      | 0.342
2   | 3  |...| 50  |  16      | 0.543

for now, I use .apply():

df['kd'] = df.timestamp.apply(lambda z: kde[z])

However, it works relatively slow, as (as far as I understand) it is still subject to GIL limitation. Is there any way to vectorise this very simple function?

Philipp_Kats
  • 3,872
  • 3
  • 27
  • 44
  • Are you after: `df = pd.DataFrame({'kd': kde})`? Or if `df` already exists: `df['kd'] = kde`... – MaxU - stand with Ukraine Jun 28 '16 at 16:54
  • MaxU, my dataframe contains a few millions of records. for each one, I need to get a value from kde, which contains 7*24*60 records. I don't think this approach will go. also, result need to depend on the timestamp value – Philipp_Kats Jun 28 '16 at 17:10
  • 1
    please provide a sample _input_ and _desired_ data sets, so we could understand how to help you... [how-to-make-good-reproducible-pandas-examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – MaxU - stand with Ukraine Jun 28 '16 at 17:12
  • 1
    `df.timestamp.apply` is passing time stamps to the `lambda` function. That means `kde[z]` is being used like a dictionary. This is inconsistent with what you've stated is happening. This could be part of the problem but we wouldn't know because you haven't provided sample data and a working example. – piRSquared Jun 28 '16 at 17:20
  • updated my question with sample data and desired outcome. hope that would be helpful – Philipp_Kats Jun 28 '16 at 17:40

2 Answers2

1

I'd do

import numpy as np
import pandas as pd

df['kd'] = np.array(kd)[df.timestamp.values]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

another solution I might use is:

kdeDF = pf.DateFrame({'kd':kde}).reset_index()
kdeDF.columns = ['index', 'kd']
data1 = data.merge(kdeDF, how='left', left_on='timestamp', right_on='index')

but it also looks pretty ugly

Philipp_Kats
  • 3,872
  • 3
  • 27
  • 44