4

I have the below data which I store in a csv (df_sample.csv). I have the column names in a list called cols_list.

df_data_sample:

df_data_sample = pd.DataFrame({
                    'new_video':['BASE','SHIVER','PREFER','BASE+','BASE+','EVAL','EVAL','PREFER','ECON','EVAL'],
                    'ord_m1':[0,1,1,0,0,0,1,0,1,0],
                    'rev_m1':[0,0,25.26,0,0,9.91,'NA',0,0,0],
                    'equip_m1':[0,0,0,'NA',24.9,20,76.71,57.21,0,12.86],
                    'oev_m1':[3.75,8.81,9.95,9.8,0,0,'NA',10,56.79,30],
                    'irev_m1':['NA',19.95,0,0,4.95,0,0,29.95,'NA',13.95]
                    })

attribute_dict = {
        'new_video': 'CAT',
        'ord_m1':'NUM',
        'rev_m1':'NUM',
        'equip_m1':'NUM',
        'oev_m1':'NUM',
        'irev_m1':'NUM'
        }

Then I read each column and do some data processing as below:

cols_list = df_data_sample.columns
# Write to csv.
df_data_sample.to_csv("df_seg_sample.csv",index = False)
#df_data_sample = pd.read_csv("df_seg_sample.csv")
#Create empty dataframe to hold final processed data for each income level.
df_final = pd.DataFrame()
# Read in each column, process, and write to a csv - using csv module
for column in cols_list:
    df_column = pd.read_csv('df_seg_sample.csv', usecols = [column],delimiter = ',')
    if (((attribute_dict[column] == 'CAT') & (df_column[column].unique().size <= 100))==True):
        df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True,prefix=column)
        # Check and remove duplicate columns if any:
        df_target_attribute = df_target_attribute.loc[:,~df_target_attribute.columns.duplicated()]

        for target_column in list(df_target_attribute.columns):
            # If variance of the dummy created is zero : append it to a list and print to log file.
            if ((np.var(df_target_attribute[[target_column]])[0] != 0)==True):
                df_final[target_column] = df_target_attribute[[target_column]]


    elif (attribute_dict[column] == 'NUM'):
        #Let's impute with 0 for numeric variables:
        df_target_attribute = df_column
        df_target_attribute.fillna(value=0,inplace=True)
        df_final[column] = df_target_attribute

attribute_dict is a dictionary containing the mapping of variable name : variable type as :

{
'new_video': 'CAT'
'ord_m1':'NUM'
'rev_m1':'NUM'
'equip_m1':'NUM'
'oev_m1':'NUM'
'irev_m1':'NUM'
}

However, this column by column operation takes long time to run on a dataset of size**(5million rows * 3400 columns)**. Currently the run time is approximately 12+ hours. I want to reduce this as much as possible and one of the ways I can think of is to do processing for all NUM columns at once and then go column by column for the CAT variables. However I am neither sure of the code in Python to achieve this nor if this will really fasten up the process. Can someone kindly help me out!

Shuvayan Das
  • 1,198
  • 3
  • 20
  • 40

2 Answers2

1

There are three things i would advice you to speed up your computaions:

  1. Take a look at pandas HDF5 capabilites. HDF is a binary file format for fast reading and writing data to disk.
  2. I would read in bigger chunks (several columns) of your csv file at once (depending on how big your memory is).
  3. There are many pandas operations you can apply to every column at once. For example nunique() (giving you the number of unique values, so you don't need unique().size). With these column-wise operations you can easily filter columns by selecting with a binary vector. E.g.
df = df.loc[:, df.nunique() > 100] 
#filter out every column where less then 100 unique values are present

Also this answer from the author of pandas on large data workflow might be interesting for you.

P.Tillmann
  • 2,090
  • 10
  • 17
  • thanks a lot @P.Tillmann.. I will definitely look at the link you have shared in more details. Seems very informative at first glance. :) – Shuvayan Das May 30 '18 at 16:15
1

For numeric columns it is simple:

num_cols = [k for k, v in attribute_dict.items() if v == 'NUM']
print (num_cols)
['ord_m1', 'rev_m1', 'equip_m1', 'oev_m1', 'irev_m1']

df1 = pd.read_csv('df_seg_sample.csv', usecols = [num_cols]).fillna(0)

But first part code is performance problem, especially in get_dummies called for 5 million rows:

df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True, prefix=column)

Unfortunately there is problem processes get_dummies in chunks.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252