I'm working on a simulation problem where some data needs to be spun up really fast. Here's the problem: let's say we have 2 datasets X and Y and we needed a new dataset Z where each row from X is used to populate new columns for Y, which is repeated for every row of X and then merged together, i.e. final dataset Z has size len(x)*len(y). What would be the most efficient way to create dataset Z? Here's what I have so far:
X = pd.DataFrame({'simulationid':[0,1,2,3],'x1':[10,20,30,40],'x2':[1,2,3,4]})
Y = pd.DataFrame({'timeofday':[5,10,15,20],'var1':[23.5,45.0,46.4,60.3]})
The simplest option (probably the least efficient) would be to iterate over each row and populate a new dataframe:
# loop over every configuration and generate dataset for each,
# then merge at the end
Z = pd.DataFrame()
cols2merge = ['x1','x2']
for index, row in X.iterrows():
for c in cols2merge:
Y[c]=row[c]
Z = pd.concat([Z,Y])
But this ends up taking a REALLY long time when the size of X increases (ranging from 1000 to 10,000). What would a smarter way to do this, taking advantage of vector operations or other Pandas-specific optimizations? I'm thinking there's an elegant one-line merge/concat/join solution but I can't seem to figure it out.
I also tried itertuples
instead of iterrows
as recommended here: https://github.com/pandas-dev/pandas/issues/10334 but didn't notice a significant improvement in execution time.
Thanks in advance!