Parse a tab and comma separated file

Question

I have a table containing several thousands line like this

A   GO:0008150,GO:0050789,GO:0050794,GO:0051726,GO:0065007
B   GO:0008150,GO:0050789,GO:0050794,GO:0051726,GO:0065007

I want to parse my table in the following format.

A   GO:0008150
A   GO:0050789
A   GO:0050794
A   GO:0051726
A   GO:0065007
B   GO:0008150
B GO:0050789
B GO:0050794
B GO:0051726
C GO:0065007

Any help will be greatly appreciated. Thanks

"I want to parse my table in the following format." What have you tried? Good luck. — shellter, Aug 19 '16 at 22:04

score 1 · Accepted Answer · edited May 23 '17 at 12:08

1

Easy with awk: just split() the second column and loop through the slices:

$ awk '{n=split($2, a, ","); for (i=1;i<=n;i++) print $1,a[i]}' file
A GO:0008150
A GO:0050789
A GO:0050794
A GO:0051726
A GO:0065007
B GO:0008150
B GO:0050789
B GO:0050794
B GO:0051726
B GO:0065007

edited May 23 '17 at 12:08

Community

1
1

answered Aug 19 '16 at 22:04

fedorqui

275,237
103
548
598

Thanks. Can you explain that what is 'a' in in first section of code? – pali Aug 20 '16 at 20:14
@pali as seen in the link I provide, it is the array in which the slices are stored. – fedorqui Aug 21 '16 at 10:04
Thanks for information – pali Aug 21 '16 at 12:24

score 1 · Answer 2 · answered Aug 20 '16 at 01:09

1

awk without loops, requires multi-char RS.

$ awk -v RS=",|\n" 'NF==2{t=$1;$1=$2} {print t,$1}' file

answered Aug 20 '16 at 01:09

karakfa

66,216
7
41
56

score 0 · Answer 3 · edited Aug 20 '16 at 18:27

You can use Python with the re module.

import re
text = '''A   GO:0008150,GO:0050789,GO:0050794,GO:0051726,GO:0065007
B   GO:0008150,GO:0050789,GO:0050794,GO:0051726,GO:0065007'''
pattern = {
'A': re.compile('A\s+(GO.*)\n'),
'B': re.compile('B\s+(GO.*)\n*')
}
A = 'A  ' + '\nA  '.join(pattern['A'].findall(text)[0].split(','))
B = 'B  ' + '\nB  '.join(pattern['B'].findall(text)[0].split(','))
print A
print B

Output:

A  GO:0008150
A  GO:0050789
A  GO:0050794
A  GO:0051726
A  GO:0065007
B  GO:0008150
B  GO:0050789
B  GO:0050794
B  GO:0051726
B  GO:0065007

Parse a tab and comma separated file

3 Answers3