1

I have a large dataframe (2,000,000+ rows by 2,000+ columns) and I want to create some new columns based on the data contained in each row. One column is a number which I want to use as an indexer to select the data in other columns. This indexer changes for each row, which means selecting the columns by their name isn't possible so I need to use numeric indexing. After identifying the columns I need, I then want to output the values to a new set of columns (after which I will perform some calculations).

Below is a simplified version of what I have tried:


# import packages

import pandas as pd
import numpy as np


# create dataframe

n = 10000
a_bar = 9; a_sd = 2
b_bar = 1000; b_sd = 100

np.random.seed(12345)
df = pd.DataFrame(dict(month_index=np.random.normal(a_bar, a_sd, size=n),
                       month_1=np.random.normal(b_bar, b_sd, size=n),
                       month_2=np.random.normal(b_bar, b_sd, size=n),
                       month_3=np.random.normal(b_bar, b_sd, size=n),
                       month_4=np.random.normal(b_bar, b_sd, size=n),
                       month_5=np.random.normal(b_bar, b_sd, size=n),
                       month_6=np.random.normal(b_bar, b_sd, size=n),
                       month_7=np.random.normal(b_bar, b_sd, size=n),
                       month_8=np.random.normal(b_bar, b_sd, size=n),
                       month_9=np.random.normal(b_bar, b_sd, size=n),
                       month_10=np.random.normal(b_bar, b_sd, size=n),
                       month_11=np.random.normal(b_bar, b_sd, size=n),
                       month_12=np.random.normal(b_bar, b_sd, size=n)
                       ),
                  columns=['month_index', 'month_1', 'month_2', 'month_3',\
                           'month_4', 'month_5', 'month_6', 'month_7',\
                           'month_8', 'month_9', 'month_10', 'month_11', 'month_12'])

# round all column values 
   
df = df.round()

# restrict value of month index

conditions = [df.month_index < 7, df.month_index > 12, (df.month_index >= 7) & (df.month_index <= 12)]
values = [7, 12, df.month_index]

df["month_index"] = np.select(conditions, values)

# reduce size of dataframe

for column in df.columns:
    df[column] = pd.to_numeric(df[column], downcast='integer')

# select relevant data using function

def select_columns(df):
    
    i = 0
         
    while i < len(df):

        j = df.at[i, 'month_index']
        df.at[i, "temp_1"] = df.iat[i, j-5]
        df.at[i, "temp_2"] = df.iat[i, j-4]
        df.at[i, "temp_3"] = df.iat[i, j-3]
        df.at[i, "temp_4"] = df.iat[i, j-2]
        df.at[i, "temp_5"] = df.iat[i, j-1]
        df.at[i, "temp_6"] = df.iat[i, j]
    
        i += 1
    
    return df

df = select_columns(df)

As I've found out this is terribly inefficient and I would much rather use pandas .apply method, or (if possible) vectorisation. Whilst I have tried out numpy's vectorisation functionality, I am aware that this approach is only feasible if columns can be passed to a function as numpy arrays. my questions are:

  1. What is the best approach for my problem (given the number of rows/columns I am working with)?
  2. Is it feasible to vectorise over a dataframe with a large number of columns?

1 Answers1

1

The simplest thing to improve performance is to build up the values in python data structures and then append columns to the dataframe once, rather than continually appending values to the dataframe.

data_structure = []
mapping = {
 'temp_1': 5,
 'temp_2': 4,
 'temp_3': 3,
 'temp_4': 2,
 'temp_5': 1,
 'temp_6': 0
}

for row in df.itertuples():
    d = {}
    for name, column_offset in mapping.items():
        idx = row.month_index - column_offset 
        column_name = df.columns[idx]
        d[name] = getattr(row, column_name)
    data_structure.append(d)

newdf = pd.DataFrame(data_structure)
combined = pd.concat([df, newdf], axis=1)

If the above solution isn't fast enough, you should be able to use multiprocessing to split up the work across multiple cores. You can try using modin to do this transparently or you may have to do it yourself.

As for vectorization, it seems like temp_1 is in the month_index -5 column. So take the simplified example below and you should be able to use vectorization to get an array of the row values by column index.

from io import StringIO
import numpy as np

# first five rows of dataframe for example
data = StringIO("""
9,1193,942,948,1053,922,985,949,987,970
10,884,1084,912,938,1179,958,842,944,1186
8,961,1070,1081,944,1074,988,1023,979,942
8,1040,1009,896,953,947,1058,1133,921,1113
12,913,1114,985,898,1011,1152,953,842,1150
""")

>>> a = np.genfromtxt(data, delimiter=',')
... temp_1_idx = (a[:, 0] - 5).astype(int)
... a[np.arange(len(a)), temp_1_idx]
array([1053., 1179., 1081.,  896.,  953.])
Eric Truett
  • 2,970
  • 1
  • 16
  • 21