1
+------------+-----+--------+-----+-------------+
| Meth.name  |  Min| Max    |Layer| Global name |
+------------+-----+--------+-----+-------------+
|   DTS      | 2600| 3041.2 | AC1 |  DTS        |
|   GGK      | 1800| 3200.0 | AC1 |  DEN        |
|   DTP      | 700 | 3041.0 | AC2 |  DT         |
|   DS       | 700 | 3041.0 | AC3 |  CALI       |
|   PF1      | 2800| 3012.0 | AC3 |  CALI       |
|   PF2      | 3000| 3041.0 | AC4 |  CALI       |
+------------+-----+--------+-----+-------------+

We have to drop duplicated rows by "Global name" column but in specific way : we wants to choose the row, which will give the biggest intersection with range calculated using max value of column "Min" and min value if column "Max" of non-duplicated rows. In example above this range will be [2600.0; 3041.0], so we wants to leave only row with ['Meth.name] == 'DS' and overall result should be like:

+------------+-----+--------+-----+-------------+
| Meth.name  |  Min| Max    |Layer| Global name |
+------------+-----+--------+-----+-------------+
|   DTS      | 2600| 3041.2 | AC1 |  DTS        |
|   GGK      | 1800| 3200.0 | AC1 |  DEN        |
|   DTP      | 700 | 3041.0 | AC2 |  DT         |
|   DS       | 700 | 3041.0 | AC3 |  CALI       |
+------------+-----+--------+-----+-------------+

This problem, of course, can be solved in several iterations (calculate interval based on non-duplicated rows and then iteratively select only those rows (from duplicated) that will give biggest intersection), but I'm trying to discover the most efficient approach Thank you

Mr.Riply
  • 825
  • 1
  • 12
  • 34

2 Answers2

0

If the order of the lines is not important you can do the following :

df['diff'] = df['Max']-df['Min']
df=df.sort_values(["Global_name","diff"],ascending=True)
df.drop_duplicates('Global_name',keep='last')

From this question

Mayeul sgc
  • 1,964
  • 3
  • 20
  • 35
0

Here is how I will go about it:

# Helper function
def calc_overlap(x):
    if min_of_max == max_of_min:
        return 0

    low = max(min_of_max, x.Min)
    high = min(max_of_min, x.Max)

    return high-low

dup_global_name = df.Global_name.value_counts()[df.Global_name.value_counts() > 1].index
dup_global_name = list(dup_global_name)

# Filter duplicates
df_dup = df[df.Global_name.isin(dup_global_name)]

# Add overlap column
df_dup['overlap'] = df_dup.apply(lambda x: calc_overlap(x), axis=1)

#Select max overlap
df_dup = df_dup.loc[df_dup.groupby('Global_name').overlap.idxmax()]

# Drop overlap col
df_dup.drop('overlap', axis=1, inplace=True)

#Concatinate with nonduplicate ones
pd.concat([df[~df.Global_name.isin(dup_global_name)], df_dup])

The desired output: enter image description here

quest
  • 3,576
  • 2
  • 16
  • 26