-1

I again run into some trobules. I have a file looking like this:

chr1    142936580   142936581   209
chr1    142936581   142936582   208
chr1    142936582   142936583   212
chr1    142936583   142936584   210
chr1    142936588   142936590   215
chr1    142936590   142936591   217
chr1    142936591   142936592   221
chr1    142936592   142936593   220
chr1    145034453   145034454   222
chr1    145034454   145034455   220
chr1    145034455   145034456   218
chr1    145034456   145034457   215
chr1    145034457   145034459   216
chr1    145034459   145034460   212
chr1    161418656   161418657   178
chr1    161418657   161418658   177
chr1    161418658   161418659   179
chr2    90386745    90386747    222
chr2    90386747    90386748    221
chr2    90386748    90386750    220

The problem here is that there are too many entries in my file, and I would like to reduce them to start:end intervals in a following way (at least that's the best I could thought of): Sorting with first column, then using only second column and reducing it. By this I mean, if entry lis in range starting from 142 keep the lowest and highest entry as start,end positions. Then moving to 145* positions and doing the same. So basically creating start,end positions for those sets of entries that are visaually apart from each other. We would end up more or less with:

chr1    142936580 142936592
chr1    145034453 145034459
chr1    161418656 161418658
chr2    90386745 90386748

Thats was my idea how to do this. However I am stuck at the point what code to use. Even suggestions are good Thanks, Irek

Irek
  • 439
  • 1
  • 8
  • 17
  • 1
    If your question is how to merge the intervals, [this question](http://stackoverflow.com/q/5679638) might help you. Do you know how to read from and write to files? If you could show us what you have so far and tell us where exactly you're stuck, I'll be glad to help you. – flornquake Aug 27 '13 at 08:59
  • You should really provide some code to show us what you have tried. – Hans Then Aug 27 '13 at 09:04

2 Answers2

1

If I understand , you want to combinate the successive interval . My Proposal

from csv import reader
LIMITINTER= 10
with open("fichierin.txt") as f:
    read = reader(f,delimiter="\t")
    first = last =  None
    for line in read:
        if last is None:
            first = last = line
        else:
            if abs(int(line[1]) - int(last[2])) < LIMITINTER :
                last = line
                continue
            else:

                print last[0], first[1], last[2]
                first = last = line

    print last[0], first[1], last[2]

you will have

chr1 142936580 142936593
chr1 145034453 145034460
chr1 161418656 161418659
chr2 90386745  90386750 

you can put it in function and yield lines, or write in another file, etc....

edit : now the minimal difference is a Constant

Philippe T.
  • 1,182
  • 7
  • 11
  • Yes this is mroe or less what I wanted. However the intervals are too small here. Start positions should differ more from each other chr1 142936580 and 142936588 are still too close to each other. – Irek Aug 27 '13 at 10:21
  • 1
    now with this edit it's a parametrable – Philippe T. Aug 27 '13 at 11:39
  • I added /10000 in the same line you made the change, and it also works nicely. thanks o/ – Irek Aug 27 '13 at 12:03
0

You can loop through the file and keep track of the first and last number in a certain range. You can extract the ranges by converting to an integer and then divide by a power of 10. Use a dictionary to store the lowest and highest values for a range.

Hans Then
  • 10,935
  • 3
  • 32
  • 51