Parsing a tab separated file using python

Question

I have a file which looks like this, this a tab separated text file

  aaa   0.0520852296    0.1648703511    0.1648703511
  bbb   0.1062639955    0.1632039268    0.1632039268
  ccc   1.4112745088    4.3654577641    4.3654577641
  ddd   0.4992644913    0.1648703511    0.1648703511
  eeee  0.169058175 0.1632039268    0.1632039268

and the output should be

aaa 0.0232736716    0.0328321936    0.0328321936
bbb 0.0474828153    0.0325003428    0.0325003428
ccc 0.6306113983    0.8693349271    0.8693349271
ddd 0.2230904597    0.0328321936    0.0328321936
eeee    0.0755416551    0.0325003428    0.0325003428

That each row/total sum of column

So on with many rows and columns for this .txt file I need to find the column sum for each column from. 2nd column to last column and then divide each numerical row with the column sum. And print it as the output. So far I have done until split and strip and from there I am not able to select select from second row.

import numpy as np
motif_path  = '/home/test/test.txt'
f         =open(motif_path,'r') 
x = f.readlines()
kk = [s.strip().split("\t") for s in x]

When I tried for i in Kk[1][1], I received and error:

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Obviously, the exact wording of that Error is relevant, as well as the line it refers to. Add that to your question! — Marcus Müller, Jun 26 '15 at 07:33
Please also add your complete `for` loop, not just the first line with a comment. — cdarke, Jun 26 '15 at 07:34
also, are you sure that you did not make a copy and paste mistake `kk` is not the same as `Kk`! — Marcus Müller, Jun 26 '15 at 07:35
Why you added the `numpy` tag? did you want a solution in `numpy`? — Mazdak, Jun 26 '15 at 07:35
possible duplicate of [How to read csv into record array in numpy?](http://stackoverflow.com/questions/3518778/how-to-read-csv-into-record-array-in-numpy) — dting, Jun 26 '15 at 07:38

score 1 · Answer 1 · answered Jun 26 '15 at 08:02

1

Why don't you use the csv reader module of python and change the delimitor from a , to a space?

import csv
motif_path  = '/home/test/test.txt'
with open(motif_path, 'rb') as csvfile:
    data = csv.reader(csvfile, delimiter=' ')
    for dI in data:
        print dI

Output

['Aaa', '0.4567', '0.6780']
['Bibb', '0.6783', '0.235']
['Cccc', '0.4567', '0.4567']

answered Jun 26 '15 at 08:02

jhoepken

1,842
3
17
24

Thank you Jen, for the reply, but myfile is tab seperated file, aaa 0.0520852296 0.1648703511 0.1648703511 bbb 0.1062639955 0.1632039268 0.1632039268 ccc 1.4112745088 4.3654577641 4.3654577641 ddd 0.4992644913 0.1648703511 0.1648703511 eeee 0.169058175 0.1632039268 0.1632039268 and the output i need is row of colum/ sum(column) – ARJ Jun 26 '15 at 08:06
1

Then replace the space with `\t` and **please** specify requirements like this in the question. Because otherwise nobody will give you a perfectly suitable answer. – jhoepken Jun 26 '15 at 08:09
And please change the title. And your syntax highlighting. – jhoepken Jun 26 '15 at 08:18
Change the title to what? – ARJ Jun 26 '15 at 08:19
To something that describes your question. Since you desire to get an answer to something else than *python array with string*, this is the only way to get an answer. – jhoepken Jun 26 '15 at 08:20

Cleb · Accepted Answer · 2015-06-26T09:25:34.063

I saw the "numpy" tag but you might consider python's "pandas" as alternative where you get the desired output within only a few lines; this way you can easily divide each entry by the sum of its column/row.

First you read in the file as a data frame and then you do the desired operations on the three columns of this dataframe. If you wish, you can then easily write this dataframe back to a .txt file (output is shown below). Let me know whether that meets your needs and whether you have question about this code.

Here is the code:

import pandas as pd
f=open('myData.txt','r')
df = pd.DataFrame(pd.read_csv(f, sep='\t', header=None, names=['val1', 'val2', 'val3']))
print df
df.loc[:,"val1":"val3"] = df.loc[:,"val1":"val3"].div(df.sum(axis=0), axis=1)
print df
df.to_csv('output.txt', header=None,sep='\t', encoding='utf-8')

And the output of this script would be:

          val1      val2      val3
aaa   0.052085  0.164870  0.164870
bbb   0.106264  0.163204  0.163204
ccc   1.411275  4.365458  4.365458
ddd   0.499264  0.164870  0.164870
eeee  0.169058  0.163204  0.163204

          val1      val2      val3
aaa   0.023274  0.032832  0.032832
bbb   0.047483  0.032500  0.032500
ccc   0.630611  0.869335  0.869335
ddd   0.223090  0.032832  0.032832
eeee  0.075542  0.032500  0.032500

and the file "output.txt" looks like this:

aaa 0.0232736716104 0.0328321936442 0.0328321936442
bbb 0.0474828152678 0.0325003427993 0.0325003427993
ccc 0.630611398322  0.869334927113  0.869334927113
ddd 0.223090459743  0.0328321936442 0.0328321936442
eeee    0.075541655057  0.0325003427993 0.0325003427993

I am afraid I cant see a check next to your answer also I am not able to do a vote to your since I am new user, I am sorry..:( — ARJ, Jun 26 '15 at 14:05
Along with the line I have to calculate entropy on the output data, and so I appiled the formula for that on output as, entropy = - sum([ p * math.log(p) / math.log(2.0) for p in df ]) but it throws error as, NameError: name 'math' is not defined May I know how can I get it done .? — ARJ, Jun 26 '15 at 14:07

user3636636 · Answer 3 · 2015-06-26T07:51:01.420

0

From the information you have provided, kk will be [['Aaa 0.4567 0.6780'], ['Bibb 0.6783. 0.235'], ['Cccc 0.4567. 0.4567'], ['']]

which means k[1][1] will be out of bounds. What was your expected output and i might be able to help further

edited Jun 26 '15 at 07:51

answered Jun 26 '15 at 07:49

user3636636

2,409
2
16
31

yes, that is the problem I need numbers alone so that I can find the sum of column , for [['Aaa 0.4567 0.6780'] I need just the numeric part so as for all rows and so then I need to find the sum of column – ARJ Jun 26 '15 at 07:50

Parsing a tab separated file using python

3 Answers3