Delete multiple columns from 500 MB tsv file with python (or perl etc)

Question

I have a very large tsv file and need to delete several columns. I've found the CSV module, and an answer as below to a sort of similar question (see script below). Yet I need to delete a large range of columns and don't want to type every single index of each column to delete. Ie from a file with 689513 columns, I'd like to remove columns 628715 through 650181 and also to remove columns 653321 to 689513. (If it's too hard to remove both sets, I can just go with removing the last ones only, ie, 653321 through 689613, or equivalently 653321 to the end of the file). Sorry for the basic question; I'm new to scripting and getting lost ... and the CSV module page doesn't go into detail on deleting column ranges. I tried doing this in R but the first cell entry is blank (see sample list below code). My file is a tsv tab delimited file, but I gather that can be rectified using a command to set the delimiter as \t. Any help is greatly appreciated!!! (Note: unfortunately I need to have colons in the names of my columns, ie 2L:1274 is a altogether the name for one column).

import csv
with open("source","rb") as source:
rdr= csv.reader( source )
with open("result","wb") as result:
    wtr= csv.writer( result )
    for r in rdr:
        wtr.writerow( (r[0], r[1], r[3], r[4]) )

2L:1274 2L:2425 2L:2853 3L:4    3L:5    3L:7
indivBCsusceptiblePL7A10_TATAGT NA  NA  NA  NA  NA  NA
indivBCsusceptiblePL7A11_CCTGAA NA  5   NA  NA  NA  NA
indivBCsusceptiblePL7A12_CAATAT NA  NA  6   7   8   9
indivBCsusceptiblePL7A1_CCGAAT  NA  NA  NA  NA  NA  NA

There's a package that is designed to do just this, but it's escaping me at the moment. I'll post as an answer if I can find it. — Ari B. Friedman, Oct 02 '12 at 15:10
I think the `colClasses` argument in R's `read.table` lets you skip reading in certain columns. — Blue Magister, Oct 02 '12 at 15:17
@AriB.Friedman is thinking of the 'colbycol' package. I would read the first line in, prepend a first column name to that character vector and then use `colClasses` with `rep("NULL, ...)` in the positions to be dumped as BlueMagister suggested. You will need to set `check.names=FALSE` to preserve the colons — IRTFM, Oct 02 '12 at 15:29

score 2 · Answer 1 · answered Oct 02 '12 at 15:33

You can use del to delete slices of a list.

with open('in.tsv', 'r') as fin, open('out.tsv', 'w') as fout:
    reader = csv.reader(fin, dialect='excel-tab')
    writer = csv.writer(fout, dialect='excel-tab')
    for row in reader:
        # delete indices in reverse order to avoid shifting earlier indices
        del row[653321:689513+1]
        del row[628715:650181+1]
        writer.writerow(row)

Francis Avila · Answer 2 · 2012-10-03T17:00:14.303

You can do this with very little memory using Python.

First define a dialect describing your tsv format. See the documentation on dialects for more information.

class TsvDialect(csv.Dialect):
    delimiter = '\t'
    quoting = csv.QUOTE_NONE
    escapechar = None

# you can just pass this class around, or you can register it under a name
csv.register_dialect('tsv', TsvDialect)

Then you can walk through each line and copy to a new tsv:

with open('source.tsv', 'rb') as src, open('result.tsv', 'wb') as res:
    csrc = csv.reader(src, dialect='tsv')
    cres = csv.writer(res, dialect='tsv')
    for row in csrc:
        cres.writerow(row)

This does a simple copy. Since you only want some rows, lets only copy those.

Python's lists are zero-indexed (the first column is column 0, not column 1); and index slicing does not include the last item (wholelist[:2] is the same as [wholelist[0], wholelist[1]]). Keep these in mind to avoid off-by-one errors!

with open('source.tsv', 'rb') as src, open('result.tsv', 'wb') as res:
    csrc = csv.reader(src, dialect='tsv')
    cres = csv.writer(res, dialect='tsv')
    for row in csrc:
        # remove [628714:650181] and [653320:689512]
        newrow = row[:628714] # columns before 628714
        newrow.extend(row[650181:653320]) # columns between 650180 and 653320
        cres.writerow(newrow)

Alternatively, instead of copying the columns you want to a new row, you can save some memory at the expense of code clarity by deleting the columns you don't want:

    for row in csrc:
        # remove [628714:650181] and [653320:689512]
        # be sure to remove in reverse order!
        del row[653320:689512]
        del row[628714:650181]
        cres.writerow(row)

You can abstract column cutting (either method, using any indexing you're comfortable with) into a function if you need to do this very often.

You might also want to take a look at the csvkit python library and command-line tools, in particular its command-line tool csvcut, which appears to do exactly what you want from the command line.

score 0 · Answer 3 · answered Oct 02 '12 at 15:15

With 2 GB RAM or more, it should be possible to load the dataset in memory, delete the columns you want, and write the contents to a file. This could either be done in R or python easily. For R:

dat = read.table("spam.tsv", ...)
dat = dat[-c(1,5)] # delete row 1 and 5
write.csv(dat, ....)

Doing this in chunks can easily be done using either an apply loop or a for loop. I use the apply style:

read_chunk = function(chunk_index, chunk_size, fname) {
    dat = read.table(fname, nrow = chunk_size, skip = (chunk_id - 1) * chunk_size, ...)
    dat = dat[-c(1,5)] # delete row 1 and 5
    write.csv(dat, append = TRUE, ....)    
}

tot_no_lines = 10000 # for example
chunk_size = 1000
sapply(1:(tot_no_lines / chunk_size), read_chunk)

Note that this is R style code useful as inspiration, no working R code.

score 0 · Answer 4 · answered Oct 02 '12 at 15:22

0

You can build the output row dynamically:

for r in rdr:
    outrow = []
    for i in range(0, 628714):
       outrow.append(r[i])
    for i in range(650181, 653320):
       outrow.append(r[i])
    wtr.writerow( outrow )

I imagine you can do this even more concisely with slices of the input row r, along the lines of:

 outrow = r[0:628714)
 outrow.extend(r[650181:653320)
 wrt.writerow( outrow )

Perhaps not the fastest to execute, but certainly easier to write.

answered Oct 02 '12 at 15:22

Larry Lustig

49,320
14
110
160

Worth noting that on Python 2.x, `range(0, 628714)` is going to be *very* slow, `xrange` should be used instead. – nneonneo Oct 02 '12 at 16:27
Also worth noting: `outrow.extend(r[:628714]); outrow.extend(r[650181:653320])` is probably a faster and more concise way to express this operation. – nneonneo Oct 02 '12 at 16:28
Using this (building output row dynamically) or the post below with multi-splice, I receive the error message: "_csv.Error: field larger than field limit (131072)." I run into problems using R because my first cell is empty (and I think I need it to remain blank for a subsequent Python script), which seems to result in R shifting everything over by one and misaligning headers/contents -- which is why I was trying python. Any further suggestions? – user1714643 Oct 02 '12 at 16:35

score 0 · Answer 5 · answered Oct 02 '12 at 16:40

0

Are you on Linux? Then save the hazzle and use csvtool from shell:

 csvtool col 1-500,502-1000 input.csv > output.csv

You can also set delimiter and so on, just type csvtool --help. Quite easy to use.

answered Oct 02 '12 at 16:40

Michael

7,316
1
37
63

Delete multiple columns from 500 MB tsv file with python (or perl etc)

5 Answers5

Linked