2

Given, I have two dataframes. The first is a dataframe with 22 columns (df1). This holds the values for all variables. The second (good_models) is a dataframe which holds the columns of interest that I want to extract from df1. For every row in the good_models, I need to use the values from (var1-var10 and target) and keep only these columns from the df1. Rename the columns (var1-var10 and target) then append to a new dataframe called model_data_long

I can perform this task using a for loop, however it is very slow and I am hoping there is a more efficient way to perform this.

Potentially I can reference this, however am unsure on how to apply it. Stack Overflow

import numpy as np
import pandas as pd
df1=pd.DataFrame(np.random.randint(0,100,size=(100,20+1)),columns=list(range(0,20+1)))
df1['target']=np.random.randint(2,size=100)
### This needs to be the columns in model_data_long
labels=['var1','var2','var3','var4','var5','var6','var7','var8','var9','var10','target']
### Contains the columns I want to exctract from df1 and append to model_data_long    
good_models=pd.DataFrame.from_records([(0,1,2,3,4,5,6,7,8,9,'target'),
                                     (9,8,7,6,5,4,3,2,1,0,'target'),
                                     (20,19,18,17,16,15,14,13,12,11,'target')],columns=labels)
### works but is slow
model_data_long=pd.DataFrame()
for i in range(0,len(good_models)):
    ### Extracting the values for a record from good_models
    t_list=good_models[good_models.index==i].values.tolist()[0]
    ### Keeping only the columns from t_list from the df1 frame.
    temp_data=pd.DataFrame(data=df1.filter(items=t_list,axis=1))
    ### renaming the columns in temp_data
    temp_data.columns=[labels]
    ### It is imparative that I have an index variable in the model_data_long dataframe.
    ### Setting the model_index variable, critical.
    temp_data['model_index']=i
    ### Finally, append to a long running dataframe.
    model_data_long=model_data_long.append([temp_data],ignore_index=True)
TrigonaMinima
  • 1,828
  • 1
  • 23
  • 35
Kyle
  • 387
  • 1
  • 5
  • 13

2 Answers2

2

You can use very fast numpy solution, thanks divakar for answer:

#convert df1 to numpy array
a = df1.values
#convert first 10 columns to numpy array
b = good_models.iloc[:, :10].values
#reshape in numpy, add all columns names without last
df = pd.DataFrame(a[:, b].swapaxes(0,1).reshape(-1,b.shape[1]), columns=labels[:-1])

#add new columns - by repating with tile and repeat
df['target'] = np.tile(df1['target'].values, len(good_models))
df['model_index'] = np.repeat(good_models.index, len(df1))

Timings:

In [251]: %timeit (jez())
100 loops, best of 3: 2.47 ms per loop

In [252]: %timeit (orig())
1 loop, best of 3: 703 ms per loop

Setup:

import numpy as np
import pandas as pd
np.random.seed(452)
df1=pd.DataFrame(np.random.randint(0,100,size=(100,20+1)),columns=list(range(0,20+1)))
df1['target']=np.random.randint(2,size=100)
### This needs to be the columns in model_data_long
labels=['var1','var2','var3','var4','var5','var6','var7','var8','var9','var10','target']
### Contains the columns I want to exctract from df1 and append to model_data_long    
good_models=pd.DataFrame.from_records([(0,1,2,3,4,5,6,7,8,9,'target'),
                                     (9,8,7,6,5,4,3,2,1,0,'target'),
                                     (20,19,18,17,16,15,14,13,12,11,'target')],columns=labels)
### works but is slow

#100 times repeat rows for 300 rows                                         
good_models = pd.concat([good_models]*100).reset_index(drop=True)

Functions:

def orig():

    model_data_long=pd.DataFrame()
    for i in range(0,len(good_models)):
        ### Extracting the values for a record from good_models
        t_list=good_models[good_models.index==i].values.tolist()[0]
        ### Keeping only the columns from t_list from the df1 frame.
        temp_data=pd.DataFrame(data=df1.filter(items=t_list,axis=1))
        ### renaming the columns in temp_data
        temp_data.columns=[labels]
        ### It is imparative that I have an index variable in the model_data_long dataframe.
        ### Setting the model_index variable, critical.
        temp_data['model_index']=i
        ### Finally, append to a long running dataframe.
        model_data_long=model_data_long.append([temp_data],ignore_index=True)

    return model_data_long


def jez():
    #convert df1 to numpy array
    a = df1.values
    #convert first 10 columns to numpy array
    b = good_models.iloc[:, :10].values
    #reshape in numpy, add all columns names without last
    df = pd.DataFrame(a[:, b].swapaxes(0,1).reshape(-1,b.shape[1]), columns=labels[:-1])

    #add new columns - by repating with tile and repeat
    df['target'] = np.tile(df1['target'].values, len(good_models))
    df['model_index'] = np.repeat(good_models.index, len(df1))
    return df
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • jesrael, first... this is lightning fast in comparison to what i was doing. I dont technically understand what all is going on here but I can tell it is accurate (modifications to your original solution). With an additional function that I added to your "jez" function, I applied the group by "model_index" so i can group by model index, apply the logit and hit it with one pass, instead of my normal for loop and subsetting. Results, 2 minutes for 10,000 models. Versus my normal approach would take around 10-15 minutes for 10,000 models. – Kyle Jan 17 '18 at 10:35
1

I figured out 2 ways of doing it without jumping on multiprocessing package. I have utilized simple pandas trick, see it works for you or not. I will share the other one later

pd.concat([pd.DataFrame(df1[i].values, columns=labels) for i in good_models.values.tolist()],ignore_index=True)
furianpandit
  • 161
  • 7