parsing large dataset with python

Question

I have a large matrix in a gzip that looks something like this:

locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0,0.5536,0.9177,0.2929,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0,0.9536,0.8177,0.2827,0.0,0.0

So, each row starts with two descriptors, followed by 10 values.

I simply want to parse out the first 5 values of this row, such that I have a matrix like this:

locus_1 mark1 0.0,0.0,0.0,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0

I have made the following python script to parse this, but to no avail:

import gzip
import numpy as np

inFile = gzip.open('/home/anish/data.gz')

inFile.next()

for line in inFile:
        cols = line.strip().replace('nan','0').split('\t')
        data = cols[2:]
        data = map(float,data)

        gfpVals =  data[:5]

        print '\t'.join(cols[:6]) + '\t' + '\t'.join(map(str,gfpVals))

I simply get the error:

data = map(float,data)
ValueError: could not convert string to float:

apologies- that not the error. the error is: data = map(float,data) ValueError: invalid literal for float(): — Alex Trevylan, Mar 10 '17 at 18:40
If you thought my response answered your question I would appreciate it if you accepted it as the answer, so that this question does not keep appearing as unanswered. — Giannis Spiliopoulos, Mar 16 '17 at 17:37
[What does your step debugger tell you?](http://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems) — , Mar 24 '17 at 18:31

score 2 · Accepted Answer · answered Mar 10 '17 at 18:54

You are using only tabs as delimiters while the values are delimited also by commas.

As a result

locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0

is split to

locus_1 || mark1 || 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0

and you are passing to float the string

"0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0"

which is an invalid literal.

You should replace:

 data = cols[2:]

with

 data = cols[2:].split(',')

parsing large dataset with python

1 Answers1