How to cut 2nd and 3rd column out of a textfile? python

Question

I have a tab-delimited file with lines as such:

foo bar bar <tab>x y z<tab>a foo foo
...

Imagine 1,000,000 lines, with up to 200 words per line. each word on average of 5-6 characters.

To the 2nd and 3rd column, I can do this:

with open('test.txt','r') as infile:
  column23 = [i.split('\t')[1:3] for i in infile]

or i could use unix, How can i get 2nd and third column in tab delim file in bash?

import os
column23 = [i.split('\t') os.popen('cut -f 2-3 test.txt').readlines()]

Which is faster? Is there any other way to extract the 2nd and 3rd column?

Why are you splitting in the last example? I **think** that cut will be faster, but you should run a benchmark with smaller test data, — Jasper, Apr 22 '14 at 14:24
Do you have a testfile we could use to see which solution is fastest? — Tim Pietzcker, Apr 22 '14 at 14:32

score 3 · Accepted Answer · answered Apr 22 '14 at 14:45

3

Use neither. Unless it proves to be too slow, use the csv module, which is far more readable.

import csv
with open('test.txt','r') as infile:
    column23 = [ cols[1:3] for cols in csv.reader(infile, delimiter="\t") ]

answered Apr 22 '14 at 14:45

chepner

497,756
71
530
681

Tim Pietzcker · Answer 2 · 2014-04-22T14:31:54.613

If there can be hundreds of tab-delimited entries per line, and you only want the second and third, then you don't need to split all of them; there is a maxsplit parameter you can use that should speed things up:

with open('test.txt','r') as infile:
    column23 = [i.split('\t', 3)[1:3] for i in infile]

And who knows, maybe a clever regex would even be faster:

import re
regex = re.compile("^[^\t\n]*\t([^\t\n]*)\t([^\t\n]*)", re.MULTILINE)
with open('test.txt','r') as infile:
    columns23 = regex.findall(infile.read())

How to cut 2nd and 3rd column out of a textfile? python

2 Answers2