7

I have 2 dataframes, df1 and df2, and want to do the following, storing results in df3:

for each row in df1:

    for each row in df2:

        create a new row in df3 (called "df1-1, df2-1" or whatever) to store results 

        for each cell(column) in df1: 

            for the cell in df2 whose column name is the same as for the cell in df1:

                compare the cells (using some comparing function func(a,b) ) and, 
                depending on the result of the comparison, write result into the 
                appropriate column of the "df1-1, df2-1" row of df3)

For example, something like:

df1
A   B    C      D
foo bar  foobar 7
gee whiz herp   10

df2
A   B   C      D
zoo car foobar 8

df3
df1-df2 A             B              C                   D
foo-zoo func(foo,zoo) func(bar,car)  func(foobar,foobar) func(7,8)
gee-zoo func(gee,zoo) func(whiz,car) func(herp,foobar)   func(10,8)

I've started with this:

for r1 in df1.iterrows():
    for r2 in df2.iterrows():
        for c1 in r1:
            for c2 in r2:

but am not sure what to do with it, and would appreciate some help.

ntalbs
  • 28,700
  • 8
  • 66
  • 83
Zubo
  • 1,543
  • 2
  • 20
  • 26
  • Because you are applying a func to columns of the same name, you could iterate solely through the columns and use vectorization, e.g. df3['A'] = func(df1['A'], df2['A']), etc? – StarFox Sep 12 '16 at 20:48
  • @StarFox interesting, so could I maybe do something like: for column in df3: df3[column] = func(df1[column], df2[column])? – Zubo Sep 12 '16 at 20:52
  • Sure! That's the power of pandas/numpy (vectorization, in general). I'll provide some examples below, and we'll go from there – StarFox Sep 12 '16 at 20:56
  • 1
    I think you can base a solution on a cartesian product between your 2 dataframes, have a look here for a starting point: http://stackoverflow.com/questions/13269890/cartesian-product-in-pandas#answer-13270110 – Svend Sep 12 '16 at 21:05
  • @Svend this seems like a promising idea, thanks! I will try StarFox' solution first, though. – Zubo Sep 12 '16 at 22:07

1 Answers1

7

So to continue the discussion in the comments, you can use vectorization, which is one of the selling points of a library like pandas or numpy. Ideally, you shouldn't ever be calling iterrows(). To be a little more explicit with my suggestion:

# with df1 and df2 provided as above, an example
df3 = df1['A'] * 3 + df2['A']

# recall that df2 only has the one row so pandas will broadcast a NaN there
df3
0    foofoofoozoo
1             NaN
Name: A, dtype: object

# more generally

# we know that df1 and df2 share column names, so we can initialize df3 with those names
df3 = pd.DataFrame(columns=df1.columns) 
for colName in df1:
    df3[colName] = func(df1[colName], df2[colName]) 

Now, you could even have different functions applied to different columns by, say, creating lambda functions and then zipping them with the column names:

# some example functions
colAFunc = lambda x, y: x + y
colBFunc = lambda x, y; x - y
....
columnFunctions = [colAFunc, colBFunc, ...]

# initialize df3 as above
df3 = pd.DataFrame(columns=df1.columns)
for func, colName in zip(columnFunctions, df1.columns):
    df3[colName] = func(df1[colName], df2[colName])

The only "gotcha" that comes to mind is that you need to be sure that your function is applicable to the data in your columns. For instance, if you were to do something like df1['A'] - df2['A'] (with df1, df2 as you have provided), that would raise a ValueError as the subtraction of two strings is undefined. Just something to be aware of.


Edit, re: your comment: That is doable as well. Iterate over the dfX.columns that is larger, so you don't run into a KeyError, and throw an if statement in there:

# all the other jazz
# let's say df1 is [['A', 'B', 'C']] and df2 is [['A', 'B', 'C', 'D']]
# so iterate over df2 columns
for colName in df2:
    if colName not in df1:
        df3[colName] = np.nan # be sure to import numpy as np
    else:
        df3[colName] = func(df1[colName], df2[colName])  
StarFox
  • 529
  • 3
  • 10
  • Yes, this is very helpful, and I have accepted it as the answer, thanks a lot for taking the time! Can this be modified to be used if the number of columns is not equal? I.e, there might be columns in df1 which do not exist in df2; the comparison function should just output something like N/A. – Zubo Sep 12 '16 at 21:33