Select pd.DataFrame rows with the biggest intersection in terms of values of Depth (specific columns)

Question

+------------+-----+--------+-----+-------------+
| Meth.name  |  Min| Max    |Layer| Global name |
+------------+-----+--------+-----+-------------+
|   DTS      | 2600| 3041.2 | AC1 |  DTS        |
|   GGK      | 1800| 3200.0 | AC1 |  DEN        |
|   DTP      | 700 | 3041.0 | AC2 |  DT         |
|   DS       | 700 | 3041.0 | AC3 |  CALI       |
|   PF1      | 2800| 3012.0 | AC3 |  CALI       |
|   PF2      | 3000| 3041.0 | AC4 |  CALI       |
+------------+-----+--------+-----+-------------+

We have to drop duplicated rows by "Global name" column but in specific way : we wants to choose the row, which will give the biggest intersection with range calculated using max value of column "Min" and min value if column "Max" of non-duplicated rows. In example above this range will be [2600.0; 3041.0], so we wants to leave only row with ['Meth.name] == 'DS' and overall result should be like:

+------------+-----+--------+-----+-------------+
| Meth.name  |  Min| Max    |Layer| Global name |
+------------+-----+--------+-----+-------------+
|   DTS      | 2600| 3041.2 | AC1 |  DTS        |
|   GGK      | 1800| 3200.0 | AC1 |  DEN        |
|   DTP      | 700 | 3041.0 | AC2 |  DT         |
|   DS       | 700 | 3041.0 | AC3 |  CALI       |
+------------+-----+--------+-----+-------------+

This problem, of course, can be solved in several iterations (calculate interval based on non-duplicated rows and then iteratively select only those rows (from duplicated) that will give biggest intersection), but I'm trying to discover the most efficient approach Thank you

*min value of column "Max" of non-duplicated rows* - why not 2200.0? — splash58, Nov 05 '19 at 08:42
Does this answer your question? [Python(pandas): removing duplicates based on two columns keeping row with max value in another column](https://stackoverflow.com/questions/32093829/pythonpandas-removing-duplicates-based-on-two-columns-keeping-row-with-max-va) — Mayeul sgc, Nov 05 '19 at 09:52
@Mayeulsgc yea, this script can help on a final steps of the way — Андрей Севостьянов, Nov 05 '19 at 10:42

Mayeul sgc · Answer 1 · 2019-11-05T09:53:07.413

0

If the order of the lines is not important you can do the following :

df['diff'] = df['Max']-df['Min']
df=df.sort_values(["Global_name","diff"],ascending=True)
df.drop_duplicates('Global_name',keep='last')

From this question

edited Nov 05 '19 at 09:53

answered Nov 05 '19 at 09:43

Mayeul sgc

1,964
3
20
35

yea, but we need calculate intersection because the bigger interval not means the bigger intersection. – Андрей Севостьянов Nov 05 '19 at 10:47
Can you put an example of a case where the bigger intersection is not the bgger interval in your question pls, would be clearer – Mayeul sgc Nov 05 '19 at 12:20

quest · Answer 2 · 2019-11-05T12:43:49.370

0

Here is how I will go about it:

# Helper function
def calc_overlap(x):
    if min_of_max == max_of_min:
        return 0

    low = max(min_of_max, x.Min)
    high = min(max_of_min, x.Max)

    return high-low

dup_global_name = df.Global_name.value_counts()[df.Global_name.value_counts() > 1].index
dup_global_name = list(dup_global_name)

# Filter duplicates
df_dup = df[df.Global_name.isin(dup_global_name)]

# Add overlap column
df_dup['overlap'] = df_dup.apply(lambda x: calc_overlap(x), axis=1)

#Select max overlap
df_dup = df_dup.loc[df_dup.groupby('Global_name').overlap.idxmax()]

# Drop overlap col
df_dup.drop('overlap', axis=1, inplace=True)

#Concatinate with nonduplicate ones
pd.concat([df[~df.Global_name.isin(dup_global_name)], df_dup])

The desired output:

edited Nov 05 '19 at 12:43

answered Nov 05 '19 at 09:44

quest

3,576
2
16
26

the code you attached is a pseudo i guess? if not : what is in "dup_global_name variable"? – Андрей Севостьянов Nov 05 '19 at 10:37
Apologies, mistake in copy pasting. Corrected it now. – quest Nov 05 '19 at 12:44

Select pd.DataFrame rows with the biggest intersection in terms of values of Depth (specific columns)

2 Answers2