1

My data frame looks like this:

NNC_009925.1     NC_009925.1     100.00  5356    0       0       5635975 5641330 1410850 1405495 0.0      9891
NC_009925.1     NC_009925.1     100.00  5356    0       0       1405495 1410890 5641330 5635975 0.0      9850
NC_009925.1     NC_009925.1     99.57   2788    12      0       3711607 3714394 1346122 1343335 0.0      5083
NC_009925.1     NC_009925.1     99.57   2788    12      0       1343335 1346122 3714394 3711659 0.0      5037

The 7th and 8th columns represent a range (Range1), while the 9th and 10th columns represent a second range (Range2). I'd like to remove all instances in the data frame where Range1 overlaps with ANY row of Range2. The criteria for which to retain would be based on the highest value in the rightmost column. So the output would look like this:

NC_009925.1     NC_009925.1     100.00  5356    0       0       5635975 5641330 1410850 1405495 0.0      9891
NC_009925.1     NC_009925.1     99.57   2788    12      0       3711607 3714394 1346122 1343335 0.0      5083
user3654634
  • 133
  • 2
  • 9
  • It is hard to see what is col 7,8,9,10. Presumably whitespace separates, but in some cases you have much more space than others. Is one space a column break, too? It looks as though there are two columns after these ranges. – ako Oct 23 '15 at 04:25
  • With overlap, do you mean *completely* overlap, or is partial overlap already enough to remove a row? –  Oct 23 '15 at 04:26
  • Are the values in columns 7 and 8 increasing? (Ranges can be decreasing, but is that the case anywhere in your case?) Ditto for columns 9 and 10. –  Oct 23 '15 at 04:27
  • What have you so far tried yourself? –  Oct 23 '15 at 04:28
  • [this](http://stackoverflow.com/questions/33264676/pandas-combining-rows-based-on-dates/33265606) question is similar in spirit (i.e. deals with overlapping ranges) – jakevdp Oct 23 '15 at 04:29
  • These columns are all separated by tabs, although a couple look like spaces (apologies on the formatting). As for overlaps, I'm looking to remove rows with partial overlaps. The values for each range can either increase or decrease. – user3654634 Oct 23 '15 at 06:20
  • For the most part, I'm having difficulty figuring out how I can compare a Range1 at row x to all Range2's. I've figured out how to use intersection to determine how much two ranges overlap, and I figure using the length of the intersection output would be a good filter to remove overlapping ranges. However, I've only been able to get ranges compared within the same row. In addition, I'm also wondering what's a good way to create these intersection values for filtering, but keep the rest of the values in the rows. – user3654634 Oct 23 '15 at 06:27

1 Answers1

1

It would help if you gave your column names.

Is your second range defined as from 10 -> 9, since the values in column 9 are greater than in column 10?

Is your example correct? As I understand the interval of the first line of your output overlaps the interval on your second line of input, and your second output line overlaps the fourth line of your input.

If I am interpreting you correctly, you could use an interval tree. You can pip install an intervaltree package at https://github.com/chaimleib/intervaltree. You could then use:

import pandas as pd
import StringIO

from intervaltree import Interval, IntervalTree


df = pd.read_table(StringIO.StringIO('''a b c d e f g h i k l m
NNC_009925.1     NC_009925.1     100.00  5356    0       0       5635975 5641330 1410850 1405495 0.0      9891
NC_009925.1     NC_009925.1     100.00  5356    0       0       1405495 1410890 5641330 5635975 0.0      9850
NC_009925.1     NC_009925.1     99.57   2788    12      0       3711607 3714394 1346122 1343335 0.0      5083
NC_009925.1     NC_009925.1     99.57   2788    12      0       1343335 1346122 3714394 3711659 0.0      5037
NC_009925.1     NC_009925.1     99.57   2788    12      0       943335 946122 3714394 3711659 0.0      5037'''), delim_whitespace=True)

range2 = IntervalTree.from_tuples(zip(df['k'], df['i']+1))

df['start_overlaps'] = df['g'].apply(lambda x: range2.overlaps(x))
df['end_overlaps'] = df['h'].apply(lambda x: range2.overlaps(x))

df['overlaps'] = df.start_overlaps | df.end_overlaps

df = df[~df.overlaps]
df

I added a 5th line as all the intervals in your example overlapped.

matt_s
  • 1,037
  • 1
  • 10
  • 17
  • When I do the IntervalTree.from_tuples line, I get an error saying: ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(3975158, 3974200). I'm assuming this is because columns 9 and 10 interchange which contains the greater value. – user3654634 Oct 23 '15 at 19:38
  • @user3654634 yes it needs them (low, high). You could create two new columns with them in the right order. – matt_s Oct 23 '15 at 20:32