I have the below data which I store in a csv (df_sample.csv). I have the column names in a list called cols_list.
df_data_sample:
df_data_sample = pd.DataFrame({
'new_video':['BASE','SHIVER','PREFER','BASE+','BASE+','EVAL','EVAL','PREFER','ECON','EVAL'],
'ord_m1':[0,1,1,0,0,0,1,0,1,0],
'rev_m1':[0,0,25.26,0,0,9.91,'NA',0,0,0],
'equip_m1':[0,0,0,'NA',24.9,20,76.71,57.21,0,12.86],
'oev_m1':[3.75,8.81,9.95,9.8,0,0,'NA',10,56.79,30],
'irev_m1':['NA',19.95,0,0,4.95,0,0,29.95,'NA',13.95]
})
attribute_dict = {
'new_video': 'CAT',
'ord_m1':'NUM',
'rev_m1':'NUM',
'equip_m1':'NUM',
'oev_m1':'NUM',
'irev_m1':'NUM'
}
Then I read each column and do some data processing as below:
cols_list = df_data_sample.columns
# Write to csv.
df_data_sample.to_csv("df_seg_sample.csv",index = False)
#df_data_sample = pd.read_csv("df_seg_sample.csv")
#Create empty dataframe to hold final processed data for each income level.
df_final = pd.DataFrame()
# Read in each column, process, and write to a csv - using csv module
for column in cols_list:
df_column = pd.read_csv('df_seg_sample.csv', usecols = [column],delimiter = ',')
if (((attribute_dict[column] == 'CAT') & (df_column[column].unique().size <= 100))==True):
df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True,prefix=column)
# Check and remove duplicate columns if any:
df_target_attribute = df_target_attribute.loc[:,~df_target_attribute.columns.duplicated()]
for target_column in list(df_target_attribute.columns):
# If variance of the dummy created is zero : append it to a list and print to log file.
if ((np.var(df_target_attribute[[target_column]])[0] != 0)==True):
df_final[target_column] = df_target_attribute[[target_column]]
elif (attribute_dict[column] == 'NUM'):
#Let's impute with 0 for numeric variables:
df_target_attribute = df_column
df_target_attribute.fillna(value=0,inplace=True)
df_final[column] = df_target_attribute
attribute_dict is a dictionary containing the mapping of variable name : variable type as :
{
'new_video': 'CAT'
'ord_m1':'NUM'
'rev_m1':'NUM'
'equip_m1':'NUM'
'oev_m1':'NUM'
'irev_m1':'NUM'
}
However, this column by column operation takes long time to run on a dataset of size**(5million rows * 3400 columns)**. Currently the run time is approximately 12+ hours. I want to reduce this as much as possible and one of the ways I can think of is to do processing for all NUM columns at once and then go column by column for the CAT variables. However I am neither sure of the code in Python to achieve this nor if this will really fasten up the process. Can someone kindly help me out!