-2

I've got a .txt such as (tabs separated):

1 2345
1 2346
1 2347
1 2348
1 2412
1 2413
1 2414

The first four consecutive lines contain the consecutive integer values 2345 through 2348. Similarly, the last three lines contain the consecutive values 2412 through 2414. I want to group them such that the minimum and maximum of these sets of consecutive values appear on a single line as shown below:

1 2345 2348
1 2412 2414

Any idea?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Sorry, how is grouping determined? – Martijn Pieters Mar 19 '13 at 16:38
  • 2
    Welcome to Stack Overflow. Non-specific questions such as this one don't generally receive high-quality answers here. Tell us what you have tried. What, specifically, didn't work, and what specific question do you have? – Robᵩ Mar 19 '13 at 16:39

3 Answers3

2

You could use a slightly modified version of Raymond Hettinger's cluster function for this:

def cluster(data, maxgap):
    """Arrange data into groups where successive elements
       differ by no more than *maxgap*

        >>> cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
        [[1, 6, 9], [100, 102, 105, 109], [134, 139]]

        >>> cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
        [[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]

    https://stackoverflow.com/a/14783998/190597 (Raymond Hettinger)
    """
    groups = [[data[0]]]
    for x in data[1:]:
        if abs(x - groups[-1][-1]) <= maxgap:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

data = []
with open('data.txt', 'r') as f:
    for line in f:
        _, num = line.split()
        data.append(int(num))
for row in cluster(data, 1):
    print('1 {s} {e}'.format(s=row[0], e=row[-1]))

yields

1 2345 2348
1 2412 2414
Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • won't work, I need to group only consecutive positions. My first approach: (next line) `line = file1.readline() parts = line.split("\t") chrom = int (parts[0]) posit = int (parts[1]) start = posit intervals = [] while line != "" : line = file1.readline() parts = line.split("\t") while (parts[0] == chrom) and (parts[1] == posit+1): posit = parts[1] line = file1.readline() end = posit intervals.append((chrom,start,end)) chrom = int (parts[0]) posit = int (parts[1]) start = posit line = file1.readline() parts = line.split("\t")` – user2187471 Mar 19 '13 at 16:45
  • Perhaps you need to clarify what you mean by consecutive position? – unutbu Mar 19 '13 at 16:48
  • @unutbu: You sorted the input data; but the input order is important. Only consecutive groups should be collapsed. :-) – Martijn Pieters Mar 19 '13 at 16:54
  • @Martijn: Thanks. That is easily sorted out. :) – unutbu Mar 19 '13 at 17:01
2

Read and write the data with the csv module, and keep track of when the 'next' group starts:

import csv

def grouped(reader):
    start = end = next(reader)
    print start, end
    for row in reader:
        if int(row[1]) - 1 != int(end[1]):
            yield (start, end)
            start = end = row
        else:
            end = row
    yield (start, end)

with open('inputfile.csv', 'rb') as inf, open('outputfile.csv', 'wb') as outf:
    inputcsv = csv.reader(inf, delimiter='\t')
    outputcsv = csv.writer(outf, delimiter='\t')
    for start, stop in grouped(inputcsv):
        outputcsv.writerow(start + stop[1:])

This writes:

1   2345    2348
1   2412    2414

to outputfile.csv for your input.

This solution never keeps more than 3 rows of data in memory, so you should be able to throw any size of CSV file at it.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
0

numpy provides some tools that could help:

In [90]: import numpy as np

In [91]: x = np.loadtxt('seq.dat', dtype=int)

In [92]: x
Out[92]: 
array([[   1, 2345],
       [   1, 2346],
       [   1, 2347],
       [   1, 2348],
       [   1, 2412],
       [   1, 2413],
       [   1, 2414],
       [   1, 2500],
       [   1, 2501],
       [   1, 2502],
       [   2, 3000],
       [   2, 3001],
       [   2, 3100],
       [   2, 3101],
       [   2, 3102],
       [   2, 3103]])

In [93]: skip = np.where(np.diff(x[:,1]) != 1)[0]

In [94]: istart = np.hstack((0, skip + 1))

In [95]: istop = np.hstack((skip, -1))

In [96]: groups = np.hstack((x[istart], x[istop, 1:]))

In [97]: groups
Out[97]: 
array([[   1, 2345, 2348],
       [   1, 2412, 2414],
       [   1, 2500, 2502],
       [   2, 3000, 3001],
       [   2, 3100, 3103]])

The first column of data is ignored when grouping, so this will need some tweaking if the first column can affect how the groups are formed.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214