I have a large dataframe (2,000,000+ rows by 2,000+ columns) and I want to create some new columns based on the data contained in each row. One column is a number which I want to use as an indexer to select the data in other columns. This indexer changes for each row, which means selecting the columns by their name isn't possible so I need to use numeric indexing. After identifying the columns I need, I then want to output the values to a new set of columns (after which I will perform some calculations).
Below is a simplified version of what I have tried:
# import packages
import pandas as pd
import numpy as np
# create dataframe
n = 10000
a_bar = 9; a_sd = 2
b_bar = 1000; b_sd = 100
np.random.seed(12345)
df = pd.DataFrame(dict(month_index=np.random.normal(a_bar, a_sd, size=n),
month_1=np.random.normal(b_bar, b_sd, size=n),
month_2=np.random.normal(b_bar, b_sd, size=n),
month_3=np.random.normal(b_bar, b_sd, size=n),
month_4=np.random.normal(b_bar, b_sd, size=n),
month_5=np.random.normal(b_bar, b_sd, size=n),
month_6=np.random.normal(b_bar, b_sd, size=n),
month_7=np.random.normal(b_bar, b_sd, size=n),
month_8=np.random.normal(b_bar, b_sd, size=n),
month_9=np.random.normal(b_bar, b_sd, size=n),
month_10=np.random.normal(b_bar, b_sd, size=n),
month_11=np.random.normal(b_bar, b_sd, size=n),
month_12=np.random.normal(b_bar, b_sd, size=n)
),
columns=['month_index', 'month_1', 'month_2', 'month_3',\
'month_4', 'month_5', 'month_6', 'month_7',\
'month_8', 'month_9', 'month_10', 'month_11', 'month_12'])
# round all column values
df = df.round()
# restrict value of month index
conditions = [df.month_index < 7, df.month_index > 12, (df.month_index >= 7) & (df.month_index <= 12)]
values = [7, 12, df.month_index]
df["month_index"] = np.select(conditions, values)
# reduce size of dataframe
for column in df.columns:
df[column] = pd.to_numeric(df[column], downcast='integer')
# select relevant data using function
def select_columns(df):
i = 0
while i < len(df):
j = df.at[i, 'month_index']
df.at[i, "temp_1"] = df.iat[i, j-5]
df.at[i, "temp_2"] = df.iat[i, j-4]
df.at[i, "temp_3"] = df.iat[i, j-3]
df.at[i, "temp_4"] = df.iat[i, j-2]
df.at[i, "temp_5"] = df.iat[i, j-1]
df.at[i, "temp_6"] = df.iat[i, j]
i += 1
return df
df = select_columns(df)
As I've found out this is terribly inefficient and I would much rather use pandas .apply method, or (if possible) vectorisation. Whilst I have tried out numpy's vectorisation functionality, I am aware that this approach is only feasible if columns can be passed to a function as numpy arrays. my questions are:
- What is the best approach for my problem (given the number of rows/columns I am working with)?
- Is it feasible to vectorise over a dataframe with a large number of columns?