-1

I have a large matrix in a gzip that looks something like this:

locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0,0.5536,0.9177,0.2929,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0,0.9536,0.8177,0.2827,0.0,0.0

So, each row starts with two descriptors, followed by 10 values.

I simply want to parse out the first 5 values of this row, such that I have a matrix like this:

locus_1 mark1 0.0,0.0,0.0,0.0,0.0
locus_2 mark2 0.0,0.0,0.0,0.0,0.0
locus_3 mark2 0.0,0.0,0.1,0.0,0.0

I have made the following python script to parse this, but to no avail:

import gzip
import numpy as np

inFile = gzip.open('/home/anish/data.gz')

inFile.next()

for line in inFile:
        cols = line.strip().replace('nan','0').split('\t')
        data = cols[2:]
        data = map(float,data)

        gfpVals =  data[:5]

        print '\t'.join(cols[:6]) + '\t' + '\t'.join(map(str,gfpVals))

I simply get the error:

data = map(float,data)
ValueError: could not convert string to float: 
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
Alex Trevylan
  • 517
  • 7
  • 17
  • apologies- that not the error. the error is: data = map(float,data) ValueError: invalid literal for float(): – Alex Trevylan Mar 10 '17 at 18:40
  • If you thought my response answered your question I would appreciate it if you accepted it as the answer, so that this question does not keep appearing as unanswered. – Giannis Spiliopoulos Mar 16 '17 at 17:37
  • [What does your step debugger tell you?](http://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems) –  Mar 24 '17 at 18:31

1 Answers1

2

You are using only tabs as delimiters while the values are delimited also by commas.

As a result

locus_1 mark1 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0

is split to

locus_1 || mark1 || 0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0

and you are passing to float the string

"0.0,0.0,0.0,0.0,0.0,0.4536,0.8177,0.4929,0.0,0.0"

which is an invalid literal.

You should replace:

 data = cols[2:]

with

 data = cols[2:].split(',')
Giannis Spiliopoulos
  • 2,628
  • 18
  • 27