I'm relatively new to Python and definitely new to pandas.
I have two DataFrames, each with different indexes (and a very different amount of rows) but with the same amount/names of columns.
The first df is supposed to be massive (significantly bigger than the existing one which is currently 200 rows, we're talking tens of thousands or more) and each row needs to compare itself with a tolerance to the other dataframe and create a new df where each row provides the matching indexes from the second df.
Basically: For every row in the first df ('df1), every column ('x') needs to be compared to the same column of every row of the second df ('df2','y') - (with the rules: x=y with tolerance = 2.5 or x>y
). Only if all the columns for a row in df2 return True, will it pass as matching for the df1 row and its index will appear in a new results df.
the first df looks like this:
PLOT SEED SUB EDGE PT 1 2 3 4 5 ... 87 88 89 90 91 92 93 94 95 96
0 117 1 1 1 1 0.241287 0.242358 0.244491 0.247731 0.252153 ... 0.296214 0.283589 0.273119 0.264486 0.257451 0.251830 0.247488 0.244320 0.242257 0.241254
3 117 1 1 2 1 0.261726 0.261483 0.262364 0.264387 0.267595 ... 0.342214 0.324959 0.310626 0.298730 0.288903 0.280866 0.274407 0.269366 0.265624 0.263096
6 117 1 1 3 1 0.247061 0.247782 0.249579 0.252491 0.256584 ... 0.308629 0.294821 0.283358 0.273881 0.266121 0.259870 0.254973 0.251313 0.248807 0.247400
9 117 1 2 1 1 0.241693 0.242740 0.244850 0.248068 0.252467 ... 0.297073 0.284368 0.273830 0.265140 0.258055 0.252392 0.248011 0.244810 0.242717 0.241685
12 117 10 1 1 1 0.241693 0.242740 0.244850 0.248068 0.252467 ... 0.297073 0.284368 0.273830 0.265140 0.258055 0.252392 0.248011 0.244810 0.242717 0.241685
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
567 567 40 3 2 1 0.264364 0.263938 0.264644 0.266498 0.269540 ... 0.348593 0.330642 0.315735 0.303358 0.293122 0.284734 0.277970 0.272660 0.268678 0.265933
570 567 41 1 1 1 0.259267 0.259194 0.260234 0.262412 0.265773 ... 0.336367 0.319737 0.305923 0.294462 0.285005 0.277287 0.271106 0.266310 0.262787 0.260456
573 567 41 2 1 1 0.306493 0.302668 0.300207 0.299056 0.299190 ... 0.467365 0.434172 0.407009 0.384594 0.366004 0.350561 0.337756 0.327205 0.318617 0.311769
576 567 41 2 2 1 0.188763 0.192255 0.196737 0.202319 0.209149 ... 0.200286 0.195093 0.190960 0.187785 0.185496 0.184040 0.183386 0.183520 0.184444 0.186179
579 567 41 3 1 1 0.263622 0.263247 0.264003 0.265905 0.268994 ... 0.346787 0.329034 0.314291 0.302051 0.291931 0.283642 0.276965 0.271731 0.267818 0.265134
194 rows × 101 columns
and the second looks like this:
1 2 3 4 5 6 7 8 9 10 ... 87 88 89 90 91 92 93 94 95 96
TYPE
1_1 0.432803 0.413895 0.398204 0.385249 0.374664 0.366171 0.359560 0.354675 0.351406 0.349681 ... 1.242901 1.017809 0.864880 0.754714 0.671995 0.607963 0.557251 0.516389 0.483038 0.455565
2_1 0.251657 0.252088 0.253608 0.256250 0.260073 0.265165 0.271646 0.279680 0.289481 0.301331 ... 0.318829 0.304012 0.291705 0.281515 0.273142 0.266361 0.260998 0.256926 0.254051 0.252307
3_1 0.250000 0.250536 0.252157 0.254898 0.258819 0.264011 0.270598 0.278746 0.288675 0.300672 ... 0.315118 0.300672 0.288675 0.278746 0.270598 0.264011 0.258819 0.254898 0.252157 0.250536
4_1 0.250000 0.250536 0.252157 0.254898 0.258819 0.264011 0.270598 0.278746 0.288675 0.300672 ... 0.315118 0.300672 0.288675 0.278746 0.270598 0.264011 0.258819 0.254898 0.252157 0.250536
5_1 0.250000 0.250536 0.252157 0.254898 0.258819 0.264011 0.270598 0.278746 0.288675 0.300672 ... 0.315118 0.300672 0.288675 0.278746 0.270598 0.264011 0.258819 0.254898 0.252157 0.250536
6_1 0.350000 0.341790 0.335360 0.330556 0.327270 0.325428 0.324991 0.325949 0.328322 0.332164 ... 0.636539 0.574485 0.525505 0.486140 0.454070 0.427688 0.405845 0.387705 0.372644 0.360194
7_1 0.350000 0.341790 0.335360 0.330556 0.327270 0.325428 0.324991 0.325949 0.328322 0.332164 ... 0.636539 0.574485 0.525505 0.486140 0.454070 0.427688 0.405845 0.387705 0.372644 0.360194
8_1 0.250000 0.250536 0.252157 0.254898 0.258819 0.264011 0.270598 0.278746 0.288675 0.300672 ... 0.315118 0.300672 0.288675 0.278746 0.270598 0.264011 0.258819 0.254898 0.252157 0.250536
8 rows × 96 columns
The result df should look similar to this:
PLOT SEED SUB EDGE PT MATCHING_TYPO
0 117 1 1 1 1 [3_1,6_1,8_1]
3 117 1 1 2 1 [1_1,7_1,8_1]
6 117 1 1 3 1 [3_1,5_1]
9 117 1 2 1 1 [7_1]
12 117 10 1 1 1 [7_1,8_1]
... ... ... ... ... ... ...
It is my understanding that running a for loop in a huge dataframe like this is not recommended, but I can't find a better suited solution, and evem if it's a loop I'm not sure how to do this comparison with tolerances /or larger than against another df.
I found df.compare to give me this error: UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U2')) -> None
I'm doing this:
parcel_a = parcel_testing_a[range(1,97)].astype('float')
parcel_a.dtypes
And getting this:
1 float64
2 float64
3 float64
4 float64
5 float64
...
92 float64
93 float64
94 float64
95 float64
96 float64
Length: 96, dtype: object
To be honest I'm pretty sure compare isn't the right method but it's what I'm familiar with so far.
Thank you in advance! Lev
EDIT: I tried converting the dataframes to numpy arrays, then looping over each individual row against each individual row in the second df, I'm getting back a list of booleans that I need to turn into the 2nd pd's indexes and then I can turn into another df.
def compare_dfs(df1,df2):
results_all=[]
df1_np = df1.to_numpy()
df2_np = df2.to_numpy()
for row in df1_np:
results_row=[]
for row_t in df2_np:
x = row - row_t
y = x > abs_tol
z = y.min()
results_row.append(y)
results_all.append(results_row)
return results_all
I'm just not sure it's the most efficient way, it seems like it has too many steps, especially considering now I have to do this process 3 times which will be 600 rows (df1) * 8 rows (df2) but in the future it will be hunderds of thousands of rows against dozens of rows.
I'd love to be corrected/get some advice. Thank you!