1

I have a big dataframe (~10 millon rows). Each row has:

  • category
  • start position
  • end position

If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows. For example

1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25

I want to filter out 1 or 2.

What I'm doing right now isn't very efficient,

import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
    dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
    if index in discard:
        continue
    df_2 = dfs[row.category]
    res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
    if len(res.index) > 1:
        discard.extend(res.index.values)
    rows.append(row)
df = pd.DataFrame(rows)

I've also tried a different approach making use of a sorted version of the dataframe.

my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
    row = df.iloc[[my_index]]
    cond = True
    next_index = 1
    while cond:
        second_row = df.iloc[[my_index + next_index]]
        c1 = (row.iloc[0].category == second_row.iloc[0].category)
        c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
        c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
        cond =  c1 and c2 and c3
        if cond and (c2 amd c3):
            indexes.append(my_index)
            cond = True
        next_index += 1
    indexes.append(my_index)
    my_index += next_index
indexes.append(total_len - 1)

The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one

I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.

user1532587
  • 993
  • 1
  • 14
  • 39
  • I think you must loop. If you had a row `cat1 17 26` should it remain? It’s within the tolerance of the second row, but you ultimately drop that row and it’s not within the tolerance of the first row, which you keep. Depending upon that, may be able to merge – ALollz Apr 01 '19 at 13:10
  • that's a good point. We could get rid of any row transitively, just keeping the first – user1532587 Apr 01 '19 at 13:13

3 Answers3

1

The approach here should be this:

  1. pandas.groupby by categories
  2. agg(Func) on groupby result
  3. the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)
Pavel Kovtun
  • 367
  • 2
  • 8
  • I'm not too much of an expert with pandas, but I suspect `apply` might be more appropriate than `agg` here. The latter seems to be suited for applying aggregation functions to each series in a group, while the former applies a function to a dataframe and returns a dataframe. The OP question doesn't want to find a single range from the group, but only remove duplicates within the group. – Christoph Burschka Apr 01 '19 at 13:45
  • there is not much difference between agg() and apply() in groupby operations in this situation – Pavel Kovtun Apr 01 '19 at 13:49
0

Do you want to merge all similar or only 2 consecutive? If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row. If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:

merge rows pandas dataframe based on condition

Petronella
  • 2,327
  • 1
  • 15
  • 24
0

I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:

dfs = []
for seq in df.category.unique():
    dfs[seq] = df[df.category == seq]

Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0), where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0).

Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:

def drop_duplicates(df):
    df = df.sort_values("sstart")
    ...

This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.

Christoph Burschka
  • 4,467
  • 3
  • 16
  • 31