Combine two data frames with overlapping ranges and calculate overlap grouped by category

Question

Building on this answer
I have two big dataframes (100K rows), df Assay has values, df Strat has 'Types'. I want to assign a 'Type' from Strat to a column in Assay based on depth. The depths are given as depth 'From' and depth 'To' columns. The 'types' are also defined by depth 'From' and 'To'. BUT they are NOT the same intervals. Assay depths may span multiple Strat types.

I want to assign the Strat 'types' to the Assay df, and if there are multiple types, try and capture that information too.

I want to loop over the data to populate the Type column for each HOLE_ID.

Generate example data:

import pandas as pd
import numpy as np
Assay=pd.DataFrame(np.array([['Hole_1',1.0,2.5,0.001],['Hole_1',2.5,5.0,0.005],['Hole_1',5.0,7.0,0.002],['Hole_1',7.0,10.0,0.001],['Hole_2',1.0,3.0,0.001],['Hole_2',3.0,5.0,0.005],['Hole_2',5.0,7.0,0.002],['Hole_2',7.0,10.0,0.001]]),columns=['HOLE_ID','FROM', 'TO', 'val'])

Strat=pd.DataFrame(np.array([['Hole_1',0.0,4.0,'A'],['Hole_1',4.0,5.0,'B'],['Hole_1',5.0,6.5,'C'],['Hole_1',6.5,8.0,'D'],['Hole_1',8.0,10.0,'E'],['Hole_2',0.0,4.0,'A'],['Hole_2',4.0,5.1,'B'],['Hole_2',5.1,6.0,'C'],['Hole_2',6.0,8.0,'D'],['Hole_2',8.0,10.0,'E']]),columns=['HOLE_ID','FROM', 'TO', 'Type'])

Assay

Out[1]: 
  HOLE_ID FROM    TO    val
0  Hole_1  1.0   2.5  0.001
1  Hole_1  2.5   5.0  0.005
2  Hole_1  5.0   7.0  0.002
3  Hole_1  7.0  10.0  0.001
4  Hole_2  1.0   3.0  0.001
5  Hole_2  3.0   5.0  0.005
6  Hole_2  5.0   7.0  0.002
7  Hole_2  7.0  10.0  0.001

Strat

Out[2]: 
  HOLE_ID FROM    TO Type
0  Hole_1  0.0   4.0    A
1  Hole_1  4.0   5.0    B
2  Hole_1  5.0   6.5    C
3  Hole_1  6.5   8.0    D
4  Hole_1  8.0  10.0    E
5  Hole_2  0.0   4.0    A
6  Hole_2  4.0   5.1    B
7  Hole_2  5.1   6.0    C
8  Hole_2  6.0   8.0    D
9  Hole_2  8.0  10.0    E

Example of desired output:

  HOLE_ID FROM    TO    val                Type
0  Hole_1  1.0   2.5  0.001              A 100%
1  Hole_1  2.5     5  0.005         A 60%,B 44%
2  Hole_1  5.0   7.0  0.002        C 80%, D 20%
3  Hole_1  7.0  10.0  0.001        D 30%, E 70%
4  Hole_2  1.0   3.0  0.001              A 100%
5  Hole_2  3.0   5.0  0.005         A 50%, B50%
6  Hole_2  5.0   7.0  0.002  B 5%, C 45%, D 50%
7  Hole_2  7.0  10.0  0.001         D 30% E 70%

My attempt is below, but doesn't work. I am not very good at loops and I have had a few promising attempts but the code seemed to run forever (note, my actual dataset is ~100k rows and 1500 HOLE_ID's so may be quite demanding on my system).

I have added np.arange so that I can use floats (0.1 m increments to generate the auxiliary series) and I think I have the percentages calculating but I am a bit out of my depth.

A bit of pre-processing to make sure only matching hole IDs are used (real data is big, and also contains additional cols not included in the example data set.)

assay_Hole_IDs =Assay['HOLE_ID'].unique().tolist()
strat_Hole_IDS =Strat['HOLE_ID'].unique().tolist()

Strat=Strat[Strat['HOLE_ID'].isin(assay_Hole_IDs)]
Assay=Assay[Assay['HOLE_ID'].isin(assay_Hole_IDs)]


assay_Hole_IDs =Assay['HOLE_ID'].unique().tolist()
strat_Hole_IDS =Strat['HOLE_ID'].unique().tolist()

check that there are no additional values


j=set(assay_Hole_IDs).symmetric_difference(set(strat_Hole_IDS))
print len(j)
j

then:

all_holes= Strat['HOLE_ID'].unique().tolist()

def getType(row):
for hole in all_holes:
    df=Strat.loc[Strat['HOLE_ID']==hole]

    units = df.set_index('Type').apply(lambda row: pd.Series(
    np.arange(row.FROM, row.TO,0.1)), axis=1).stack()\
    .reset_index(level=1, drop=True)

    gr = units[units.ge(row.FROM) & units.lt(row.TO)].groupby(level=0)
    if gr.ngroups == 1:
        return gr.ngroup().index[0]

    txt = []
    counts = []
    pct=[]
    for key, grp in gr:
        siz = grp.size
        un = 'unit' if siz == 1 else 'units'

        counts.append(float(siz))
    for x in counts:
        p=(float(x)/float(sum(counts))*100)
        pct.append(float(p))
    return pct

then:

assay['Type'] = assay.groupby('HOLE_ID').apply(getType)

Can anyone see why this isn't working?

Can you explain the logic a little more? That will help others answer your question. — Umar.H, Oct 28 '19 at 09:01

splash58 · Accepted Answer · 2019-10-28T13:38:13.400

1

def group(df1):
    df2 = Strat[Strat['HOLE_ID']==df1.iloc[0]['HOLE_ID']]
    df1[['FROM','TO']] = df1[['FROM','TO']].astype(float)
    df2[['FROM','TO']] = df2[['FROM','TO']].astype(float)

    temp =  pd.concat([df1[['FROM','TO']],df2[['FROM','TO']]]).unstack().reset_index(drop=True) \
              .drop_duplicates().sort_values().reset_index(drop=True) \
              .to_frame(name='FROM').merge(df2, how='outer').ffill()
   temp['TO'] = temp.shift(-1)['FROM']


    def tobins(x):
        agg = temp[(x.FROM <= temp.FROM) & (temp.FROM < x.TO)].groupby('Type') \
                .apply(lambda y: y['TO'].max() - y['FROM'].min()).reset_index(name='res')
        agg.res = agg.Type + ' ' + (agg.res/agg.res.sum()).map('{:.0%}'.format)
        return '; '.join(agg.res.tolist())

    df1['Type'] = df1.apply(tobins,axis=1)
    return df1

Assay.groupby('HOLE_ID').apply(group)

  HOLE_ID  FROM    TO    val          Type
0  Hole_1   1.0   2.5  0.001        A 100%
1  Hole_1   2.5   5.0  0.005  A 60%; B 40%
2  Hole_1   5.0   7.0  0.002  C 75%; D 25%
3  Hole_1   7.0  10.0  0.001  D 33%; E 67%
4  Hole_2   1.0   3.0  0.001        A 100%
5  Hole_2   3.0   5.0  0.005        B 100%
6  Hole_2   5.0   7.0  0.002  C 75%; D 25%
7  Hole_2   7.0  10.0  0.001  D 33%; E 67%

The key point is building temp DataFrame with all point FROM and TO from both table. For HOLE_ID = 'Hole_1' it looks as below. Now we can get for each row of Assay (x) those rows of temp table with (x.FROM <= temp.FROM < x.TO), group them by Type, calculate shares and join to result format

   FROM HOLE_ID    TO Type
0   0.0  Hole_1   1.0    A
1   1.0  Hole_1   2.5    A
2   2.5  Hole_1   4.0    A
3   4.0  Hole_1   5.0    B
4   5.0  Hole_1   6.5    C
5   6.5  Hole_1   7.0    D
6   7.0  Hole_1   8.0    D
7   8.0  Hole_1  10.0    E
8  10.0  Hole_1   NaN    E

edited Oct 28 '19 at 13:38

answered Oct 28 '19 at 12:53

splash58

26,043
3
22
34

I am getting" KeyError: 'HOLE_ID' ", I made sure that all the Strat values are not NaNs. I noticed that there are more HOLE_IDs in one df than the other in my data so I have constructed a df that only contains matching HOLE_IDs but still get the same error. I do have additional columns in both dfs but that shouldn't matter, right? What if there are undefined Types or no matches? Python2.7(i know, I know...) – flashliquid Oct 29 '19 at 07:11
I'm not sure about 2.7 :( At what line is error occured? *no matches* should not be a problem – splash58 Oct 29 '19 at 07:33
1 assay.groupby('HOLE_ID').apply(group) 2 df2 = strat[strat['HOLE_ID']==df1.iloc[0]['HOLE_ID']] – flashliquid Oct 29 '19 at 07:41
I get the same behaviour if I add an extra Hole_ID to the Assay df in the example data. e.g. rename one entry Hole_3 – flashliquid Oct 29 '19 at 07:53
If there is no corresponding hole_id in Strat - it a really problem. You can add `if not len(df2): return` after line df2=... – splash58 Oct 29 '19 at 08:09
I am pretty confident that only corresponding Hole_IDs are in my data set (see processing step added to original post)... – flashliquid Oct 29 '19 at 08:55
I have some strat Hole_IDs appear in both dfs but there are no Type values for all depths that appear in the Assay df. Could this be the problem? – flashliquid Oct 30 '19 at 08:06
You can print temp table to see that all Assay ranges are covered. If not, it will be better to see example of temp and corresponding df1 to understand is it a source of the error – splash58 Oct 30 '19 at 08:20
Yes! I realised this and found a single row out of 150K with no 'Type' recorded for it, it should have been dropped (I dropped Nans in a data cleansing step) but must be an empty string. Working perfectly now, thank you! – flashliquid Oct 30 '19 at 09:09
I'm really glad of such final, because without real data it's a problem to debug :) – splash58 Oct 30 '19 at 09:13

Combine two data frames with overlapping ranges and calculate overlap grouped by category

1 Answers1

Linked