1

I'm relatively new to Python and definitely new to pandas.

I have two DataFrames, each with different indexes (and a very different amount of rows) but with the same amount/names of columns.

The first df is supposed to be massive (significantly bigger than the existing one which is currently 200 rows, we're talking tens of thousands or more) and each row needs to compare itself with a tolerance to the other dataframe and create a new df where each row provides the matching indexes from the second df.

Basically: For every row in the first df ('df1), every column ('x') needs to be compared to the same column of every row of the second df ('df2','y') - (with the rules: x=y with tolerance = 2.5 or x>y). Only if all the columns for a row in df2 return True, will it pass as matching for the df1 row and its index will appear in a new results df.

the first df looks like this:

    PLOT SEED  SUB EDGE   PT         1         2         3         4         5  ...        87        88        89        90        91        92        93        94        95        96
0    117    1    1    1    1  0.241287  0.242358  0.244491  0.247731  0.252153  ...  0.296214  0.283589  0.273119  0.264486  0.257451  0.251830  0.247488  0.244320  0.242257  0.241254
3    117    1    1    2    1  0.261726  0.261483  0.262364  0.264387  0.267595  ...  0.342214  0.324959  0.310626  0.298730  0.288903  0.280866  0.274407  0.269366  0.265624  0.263096
6    117    1    1    3    1  0.247061  0.247782  0.249579  0.252491  0.256584  ...  0.308629  0.294821  0.283358  0.273881  0.266121  0.259870  0.254973  0.251313  0.248807  0.247400
9    117    1    2    1    1  0.241693  0.242740  0.244850  0.248068  0.252467  ...  0.297073  0.284368  0.273830  0.265140  0.258055  0.252392  0.248011  0.244810  0.242717  0.241685
12   117   10    1    1    1  0.241693  0.242740  0.244850  0.248068  0.252467  ...  0.297073  0.284368  0.273830  0.265140  0.258055  0.252392  0.248011  0.244810  0.242717  0.241685
...  ...  ...  ...  ...  ...       ...       ...       ...       ...       ...  ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...
567  567   40    3    2    1  0.264364  0.263938  0.264644  0.266498  0.269540  ...  0.348593  0.330642  0.315735  0.303358  0.293122  0.284734  0.277970  0.272660  0.268678  0.265933
570  567   41    1    1    1  0.259267  0.259194  0.260234  0.262412  0.265773  ...  0.336367  0.319737  0.305923  0.294462  0.285005  0.277287  0.271106  0.266310  0.262787  0.260456
573  567   41    2    1    1  0.306493  0.302668  0.300207  0.299056  0.299190  ...  0.467365  0.434172  0.407009  0.384594  0.366004  0.350561  0.337756  0.327205  0.318617  0.311769
576  567   41    2    2    1  0.188763  0.192255  0.196737  0.202319  0.209149  ...  0.200286  0.195093  0.190960  0.187785  0.185496  0.184040  0.183386  0.183520  0.184444  0.186179
579  567   41    3    1    1  0.263622  0.263247  0.264003  0.265905  0.268994  ...  0.346787  0.329034  0.314291  0.302051  0.291931  0.283642  0.276965  0.271731  0.267818  0.265134
194 rows × 101 columns

and the second looks like this:

             1         2         3         4         5         6         7         8         9        10  ...        87        88        89        90        91        92        93        94        95        96
TYPE                                                                                                                                                                                                             
1_1   0.432803  0.413895  0.398204  0.385249  0.374664  0.366171  0.359560  0.354675  0.351406  0.349681  ...  1.242901  1.017809  0.864880  0.754714  0.671995  0.607963  0.557251  0.516389  0.483038  0.455565
2_1   0.251657  0.252088  0.253608  0.256250  0.260073  0.265165  0.271646  0.279680  0.289481  0.301331  ...  0.318829  0.304012  0.291705  0.281515  0.273142  0.266361  0.260998  0.256926  0.254051  0.252307
3_1   0.250000  0.250536  0.252157  0.254898  0.258819  0.264011  0.270598  0.278746  0.288675  0.300672  ...  0.315118  0.300672  0.288675  0.278746  0.270598  0.264011  0.258819  0.254898  0.252157  0.250536
4_1   0.250000  0.250536  0.252157  0.254898  0.258819  0.264011  0.270598  0.278746  0.288675  0.300672  ...  0.315118  0.300672  0.288675  0.278746  0.270598  0.264011  0.258819  0.254898  0.252157  0.250536
5_1   0.250000  0.250536  0.252157  0.254898  0.258819  0.264011  0.270598  0.278746  0.288675  0.300672  ...  0.315118  0.300672  0.288675  0.278746  0.270598  0.264011  0.258819  0.254898  0.252157  0.250536
6_1   0.350000  0.341790  0.335360  0.330556  0.327270  0.325428  0.324991  0.325949  0.328322  0.332164  ...  0.636539  0.574485  0.525505  0.486140  0.454070  0.427688  0.405845  0.387705  0.372644  0.360194
7_1   0.350000  0.341790  0.335360  0.330556  0.327270  0.325428  0.324991  0.325949  0.328322  0.332164  ...  0.636539  0.574485  0.525505  0.486140  0.454070  0.427688  0.405845  0.387705  0.372644  0.360194
8_1   0.250000  0.250536  0.252157  0.254898  0.258819  0.264011  0.270598  0.278746  0.288675  0.300672  ...  0.315118  0.300672  0.288675  0.278746  0.270598  0.264011  0.258819  0.254898  0.252157  0.250536
8 rows × 96 columns

The result df should look similar to this:

    PLOT  SEED  SUB  EDGE  PT  MATCHING_TYPO
0    117     1    1     1   1  [3_1,6_1,8_1]
3    117     1    1     2   1  [1_1,7_1,8_1]
6    117     1    1     3   1      [3_1,5_1]
9    117     1    2     1   1          [7_1]
12   117    10    1     1   1      [7_1,8_1]
...  ...   ...  ...   ... ...            ...    

It is my understanding that running a for loop in a huge dataframe like this is not recommended, but I can't find a better suited solution, and evem if it's a loop I'm not sure how to do this comparison with tolerances /or larger than against another df.

I found df.compare to give me this error: UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U2')) -> None

I'm doing this:

parcel_a = parcel_testing_a[range(1,97)].astype('float')
parcel_a.dtypes

And getting this:

1     float64
2     float64
3     float64
4     float64
5     float64
       ...   
92    float64
93    float64
94    float64
95    float64
96    float64
Length: 96, dtype: object

To be honest I'm pretty sure compare isn't the right method but it's what I'm familiar with so far.

Thank you in advance! Lev

EDIT: I tried converting the dataframes to numpy arrays, then looping over each individual row against each individual row in the second df, I'm getting back a list of booleans that I need to turn into the 2nd pd's indexes and then I can turn into another df.

def compare_dfs(df1,df2):
results_all=[]
df1_np = df1.to_numpy()
df2_np = df2.to_numpy()
for row in df1_np:
    results_row=[]
    for row_t in df2_np:
        x = row - row_t
        y = x > abs_tol
        z = y.min()
        results_row.append(y)
    results_all.append(results_row)
return results_all

I'm just not sure it's the most efficient way, it seems like it has too many steps, especially considering now I have to do this process 3 times which will be 600 rows (df1) * 8 rows (df2) but in the future it will be hunderds of thousands of rows against dozens of rows.

I'd love to be corrected/get some advice. Thank you!

wjandrea
  • 28,235
  • 9
  • 60
  • 81
  • 1
    please describe what you mean by this *"row needs to compare itself with a tolerance to the other dataframe"* mathematically.... – D.L Jun 14 '23 at 17:12
  • Well, I have been doing something similar with 4 million dataset which took a week to compute and i started using numpy which made it in some hours, your problem can be solved by my solution but please make it clear, your rules and what you are trying to get and some codes too.\ – Ajay Pun Magar Jun 14 '23 at 17:24
  • Sorry for being unclear, I edited the post. Basically: For every row in the first df ('df1), every column ('x') needs to be compared to the same column of every row of the second df ('df2','y') - (with the rules: x=y with tolerance = 2.5 or x>y). Only if all the columns for a row in df2 return True, will it pass as matching for the df1 row and its index will appear in a new results df. – Lev Zhitnik Jun 14 '23 at 17:59
  • 1
    @LevZhitnik It'd help to make a [mre] since this is a lot of data. See [How to make good reproducible pandas examples](/q/20109391/4518341). Surely 3 compared columns would be fine for the sake of example, and only one other column in `df1`. – wjandrea Jun 16 '23 at 16:02
  • You may be looking for [.merge()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) or [.query()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html). But honestly, if you have a ton of rows you're probably better off writing them to an [RDBMS](https://docs.python.org/3/library/sqlite3.html) table, defining an index, and issuing a [JOIN](https://en.wikipedia.org/wiki/Join_(SQL)) query that has a WHERE clause which specifies your match tolerances. – J_H Jun 16 '23 at 16:40

0 Answers0