1

Given these two dataframes:

df1 =
     Name  Start  End
  0  A     10     20
  1  B     20     30
  2  C     30     40

df2 =
     0   1
  0  5   10
  1  15  20
  2  25  30

df2 has no column names, but you can assume column 0 is an offset of df1.Start and column 1 is an offset of df1.End. I would like to transpose df2 onto df1 to get the Start and End differences. The final df1 dataframe should look like this:

  Name  Start  End  Start_Diff_0  End_Diff_0  Start_Diff_1  End_Diff_1  Start_Diff_2  End_Diff_2
0    A     10   20             5          10            -5           0           -15         -10
1    B     20   30            15          20             5          10            -5           0
2    C     30   40            25          30            15          20             5          10

I have a solution that works, but I'm not satisfied with it because it takes too long to run when processing a dataframe that has millions of rows. Below is a sample test case to simulate processing 30,000 rows. As you can imagine, running the original solution (method_1) on a 1GB dataframe is going to be a problem. Is there a faster way to do this using Pandas, Numpy, or maybe another package?

UPDATE: I've added the provided solutions to the benchmarks.

# Import required modules
import numpy as np
import pandas as pd
import timeit

# Original
def method_1():
    df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
    df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
    # Store data for new columns in a dictionary
    new_columns = {}
    for index1, row1 in df1.iterrows():
        for index2, row2 in df2.iterrows():
            key_start = 'Start_Diff_' + str(index2)
            key_end = 'End_Diff_' + str(index2)
            if (key_start in new_columns):
                new_columns[key_start].append(row1[1]-row2[0])
            else:
                new_columns[key_start] = [row1[1]-row2[0]]
            if (key_end in new_columns):
                new_columns[key_end].append(row1[2]-row2[1])
            else:
                new_columns[key_end] = [row1[2]-row2[1]]
    # Add dictionary data as new columns
    for key, value in new_columns.items():
        df1[key] = value

# jezrael - https://stackoverflow.com/a/60843750/452587
def method_2():
    df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
    df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
    # Convert selected columns to 2d numpy array
    a = df1[['Start', 'End']].to_numpy()
    b = df2[[0, 1]].to_numpy()
    # Output is 3d array; convert it to 2d array
    c = (a - b[:, None]).swapaxes(0, 1).reshape(a.shape[0], -1)
    # Generate columns names and with DataFrame.join; add to original
    cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
    df1 = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))

# sammywemmy - https://stackoverflow.com/a/60844078/452587
def method_3():
    df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
    df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
    # Create numpy arrays of df1 and df2
    df1_start = df1.loc[:, 'Start'].to_numpy()
    df1_end = df1.loc[:, 'End'].to_numpy()
    df2_start = df2[0].to_numpy()
    df2_end = df2[1].to_numpy()
    # Use np tile to create shapes that allow elementwise subtraction
    tiled_start = np.tile(df1_start, (len(df2), 1)).T
    tiled_end = np.tile(df1_end, (len(df2), 1)).T
    # Subtract df2 from df1
    start = np.subtract(tiled_start, df2_start)
    end = np.subtract(tiled_end, df2_end)
    # Create columns for start and end
    start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
    end_columns = [f'End_Diff_{num}' for num in range(len(df2))]
    # Create dataframes of start and end
    start_df = pd.DataFrame(start, columns=start_columns)
    end_df = pd.DataFrame(end, columns=end_columns)
    # Lump start and end into one dataframe
    lump = pd.concat([start_df, end_df], axis=1)
    # Sort the columns by the digits at the end
    filtered = lump.columns[lump.columns.str.contains('\d')]
    cols = sorted(filtered, key=lambda x: x[-1])
    lump = lump.reindex(cols, axis='columns')
    # Hook lump back to df1
    df1 = pd.concat([df1,lump],axis=1)

print('Method 1:', timeit.timeit(method_1, number=3))
print('Method 2:', timeit.timeit(method_2, number=3))
print('Method 3:', timeit.timeit(method_3, number=3))

Output:

Method 1: 50.506279182
Method 2: 0.08886280600000163
Method 3: 0.10297686199999845
thdoan
  • 18,421
  • 1
  • 62
  • 57

3 Answers3

2

I suggest use here numpy - convert selected columns to 2d numpy array in first step::

a = df1[['Start','End']].to_numpy()
b = df2[[0,1]].to_numpy()

Output is 3d array, convert it to 2d array:

c = (a - b[:, None]).swapaxes(0,1).reshape(a.shape[0],-1)
print (c)
[[  5  10  -5   0 -15 -10]
 [ 15  20   5  10  -5   0]
 [ 25  30  15  20   5  10]]

Last generate columns names and with DataFrame.join add to original:

cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
print (df)
  Name  Start  End  Start_Diff_0  End_Diff_0  Start_Diff_1  End_Diff_1  \
0    A     10   20             5          10            -5           0   
1    B     20   30            15          20             5          10   
2    C     30   40            25          30            15          20   

   Start_Diff_2  End_Diff_2  
0           -15         -10  
1            -5           0  
2             5          10  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Don't use iterrows(). If you're simply subtracting values, use vectorization with Numpy (Pandas also offers vectorization, but Numpy is faster).

For instance:

df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)

col_names = "Start_Diff_1 End_Diff_1".split()
df3 = pd.DataFrame(df2.to_numpy() - 10, columns=colnames)

Here df3 equals:

    Start_Diff_1    End_Diff_1
0           -5              0
1           5               10
2           15              20

You can also change column names by doing:

df2.columns = "Start_Diff_0 End_Diff_0".split()

You can use f-strings to change column names in a loop, i.e., f"Start_Diff_{i}", where i is a number in a loop

You can also combine multiple dataframes with:

df = pd.concat([df1, df2],axis=1)
bug_spray
  • 1,445
  • 1
  • 9
  • 23
  • Andre, thanks for the solution. As I'm new to Numpy, it will take me some time to understand what your code is actually doing. Once I understand it more I will pick a solution that works best with the real dataset given to me. – thdoan Mar 27 '20 at 02:06
  • Let me know if there's anything you're confused about. Vectorization just means doing stuff like `A+B` or `A-B` for arrays A and B, instead of subtracting or adding elements individually. You can't do that with python lists, but you can with numpy arrays and pandas dataframes. Note that not all operations / functions can be vectorized, but if you're doing something math heavy, chances are it can be, and it tends to be significantly faster than iteration and python for loops. – bug_spray Mar 27 '20 at 04:04
  • 1
    Check out cs95's answer to this stackoverflow question: https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758 – bug_spray Mar 27 '20 at 04:09
  • bug_spray, I just read the excellent answer provided by cs95. Vectorization reminds me of the matrix arithmetic we did in high school :). jezrael is using this, right -- my benchmarks show that their solution was magnitudes faster than mine (I will update my description with benchmarks soon). – thdoan Mar 28 '20 at 19:47
0

This is one way to go about it:

 #create numpy arrays of df1 and 2

df1_start = df1.loc[:,'Start'].to_numpy()
df1_end = df1.loc[:,'End'].to_numpy()

df2_start = df2[0].to_numpy()
df2_end = df2[1].to_numpy()

#use np tile to create shapes
#that allow element wise subtraction
tiled_start = np.tile(df1_start,(len(df2),1)).T
tiled_end = np.tile(df1_end,(len(df2),1)).T

#subtract df2 from df1
start = np.subtract(tiled_start,df2_start)
end = np.subtract(tiled_end, df2_end)

#create columns for start and end
start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
end_columns = [f'End_Diff_{num}' for num in range(len(df2))]

#create dataframes of start and end
start_df = pd.DataFrame(start,columns=start_columns)
end_df = pd.DataFrame(end, columns = end_columns)

#lump start and end into one dataframe
lump = pd.concat([start_df,end_df],axis=1)

#sort the columns by the digits at the end
filtered = final.columns[final.columns.str.contains('\d')]

cols = sorted(filtered, key = lambda x: x[-1])

lump = lump.reindex(cols,axis='columns')

#hook lump back to df1
final = pd.concat([df1,lump],axis=1)
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • Thanks for the answer. I kept getting "local variable 'final' referenced before assignment" error; I fixed it by changing `final` to `lump` when defining `filtered` and since I'm appending columns to df1, I changed `final` to `df1` in last line. – thdoan Mar 28 '20 at 19:37