4

I'm working on a simulation problem where some data needs to be spun up really fast. Here's the problem: let's say we have 2 datasets X and Y and we needed a new dataset Z where each row from X is used to populate new columns for Y, which is repeated for every row of X and then merged together, i.e. final dataset Z has size len(x)*len(y). What would be the most efficient way to create dataset Z? Here's what I have so far:

X = pd.DataFrame({'simulationid':[0,1,2,3],'x1':[10,20,30,40],'x2':[1,2,3,4]})
Y = pd.DataFrame({'timeofday':[5,10,15,20],'var1':[23.5,45.0,46.4,60.3]})

The simplest option (probably the least efficient) would be to iterate over each row and populate a new dataframe:

# loop over every configuration and generate dataset for each,
# then merge at the end
Z = pd.DataFrame()
cols2merge = ['x1','x2']
for index, row in X.iterrows():        
    for c in cols2merge:
        Y[c]=row[c]
    Z = pd.concat([Z,Y])

But this ends up taking a REALLY long time when the size of X increases (ranging from 1000 to 10,000). What would a smarter way to do this, taking advantage of vector operations or other Pandas-specific optimizations? I'm thinking there's an elegant one-line merge/concat/join solution but I can't seem to figure it out.

I also tried itertuples instead of iterrows as recommended here: https://github.com/pandas-dev/pandas/issues/10334 but didn't notice a significant improvement in execution time.

Thanks in advance!

Madoo
  • 115
  • 6
  • 1
    [Numpy cartesian product](https://stackoverflow.com/questions/11144513/numpy-cartesian-product-of-x-and-y-array-points-into-single-array-of-2d-points) has a rather impressive discussion on how you might fully optimize this. – ALollz Sep 27 '18 at 20:17
  • Never hit me that this was a cartesian product problem! Thank you for pointing that out. – Madoo Sep 28 '18 at 18:40

1 Answers1

3

There are probably faster numpy based solutions, but you can just do an enormous merge:

cols2merge = ['x1','x2']
X[cols2merge].assign(dummy=1).merge(Y.assign(dummy=1), on='dummy').drop(columns='dummy')

Output:

    x1  x2  timeofday  var1
0   10   1          5  23.5
1   10   1         10  45.0
2   10   1         15  46.4
3   10   1         20  60.3
4   20   2          5  23.5
5   20   2         10  45.0
6   20   2         15  46.4
7   20   2         20  60.3
8   30   3          5  23.5
9   30   3         10  45.0
10  30   3         15  46.4
11  30   3         20  60.3
12  40   4          5  23.5
13  40   4         10  45.0
14  40   4         15  46.4
15  40   4         20  60.3
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    That's a clever one-liner. I benchmarked this solution and it's a lot faster. Thank you! – Madoo Sep 27 '18 at 20:14