vectorised list.get() solution in numpy/pandas

Question

I have a temporal KDF kernel as a list (or numpy array) of values, where value index represents corresponding minute in a week.

my data is approximate as described below: - kde: list or ndarray of float values, with the length of 7*24*60. - df: DataFrame with ~ 50 columns of different types, including timestamp column with integer values within the range (0 to 7*24*60-1). Dataframe has ~ 2000000 records.

as a sample:

col1|col2|...|col49|timestamp
1   | 2  |...| 49  |  15
2   | 3  |...| 50  |  16

My desired output should be the very same dataframe, with kd column, including corresponding values from kde. In other words, for each record in the data frame, I need to get KDE value using record timestamp. I need to do it as fast as possible.

Desired outcome:

col1|col2|...|col49|timestamp | kd
1   | 2  |...| 49  |  15      | 0.342
2   | 3  |...| 50  |  16      | 0.543

for now, I use .apply():

df['kd'] = df.timestamp.apply(lambda z: kde[z])

However, it works relatively slow, as (as far as I understand) it is still subject to GIL limitation. Is there any way to vectorise this very simple function?

Are you after: `df = pd.DataFrame({'kd': kde})`? Or if `df` already exists: `df['kd'] = kde`... — MaxU - stand with Ukraine, Jun 28 '16 at 16:54
MaxU, my dataframe contains a few millions of records. for each one, I need to get a value from kde, which contains 7*24*60 records. I don't think this approach will go. also, result need to depend on the timestamp value — Philipp_Kats, Jun 28 '16 at 17:10
please provide a sample _input_ and _desired_ data sets, so we could understand how to help you... [how-to-make-good-reproducible-pandas-examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — MaxU - stand with Ukraine, Jun 28 '16 at 17:12
`df.timestamp.apply` is passing time stamps to the `lambda` function. That means `kde[z]` is being used like a dictionary. This is inconsistent with what you've stated is happening. This could be part of the problem but we wouldn't know because you haven't provided sample data and a working example. — piRSquared, Jun 28 '16 at 17:20
updated my question with sample data and desired outcome. hope that would be helpful — Philipp_Kats, Jun 28 '16 at 17:40

piRSquared · Accepted Answer · 2016-06-28T18:06:39.787

1

I'd do

import numpy as np
import pandas as pd

df['kd'] = np.array(kd)[df.timestamp.values]

edited Jun 28 '16 at 18:06

answered Jun 28 '16 at 17:56

piRSquared

285,575
57
475
624

score 0 · Answer 2 · answered Jun 28 '16 at 17:48

0

another solution I might use is:

kdeDF = pf.DateFrame({'kd':kde}).reset_index()
kdeDF.columns = ['index', 'kd']
data1 = data.merge(kdeDF, how='left', left_on='timestamp', right_on='index')

but it also looks pretty ugly

answered Jun 28 '16 at 17:48

Philipp_Kats

3,872
3
27
44

vectorised list.get() solution in numpy/pandas

2 Answers2