I have a text file in this rather horrendous HTML format:
A<b>Metabolism</b>
B
B <b>Overview</b>
C 01200 Carbon metabolism [PATH:bpe01200]
D BP3142 pgi; glucose-6-phosphate isomerase K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D BP1971 pgi; glucose-6-phosphate isomerase K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
D BP1519 fba; fructose-1,6-bisphosphate aldolase K01624 FBA; fructose-bisphosphate aldolase, class II [EC:4.1.2.13]
D BP0801 tpiA; triosephosphate isomerase K01803 TPI; triosephosphate isomerase (TIM) [EC:5.3.1.1]
D BP1000 gap; glyceraldehyde-3-phosphate dehydrogenase K00134 GAPDH; glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12]
I would like to parse this file into columns in R.
such as:
A,Metabolism
B,
B,Overview
C,01200,Carbon metabolism,Path,bpe01200
D,BP3142,Pgi,glucose-6-phosphate isomerase,GPI,glucose-6-phosphate isomerase,[EC:5.3.1.9]
...
D,BP1000,gap,glyceraldehyde-3-phosphate dehydrogenase,K00134,GAPDH,glyceraldehyde 3-phosphate dehydrogenase,[EC:1.2.1.12]
The problem is that the delimiter changes in each part of the line. It seems to follow this pattern e.g
D BP1971 pgi; glucose-6-phosphate isomerase K01810 GPI; glucose-6-phosphate isomerase [EC:5.3.1.9]
^Tab ^space^Semi colon ^tab ^space^semi colon
I can think of the not so smart way to do it.By parsing 1 delimiter at a time. But does anyone have any smart solutions? or know of a tool that can interpret this nicely?
I would really appreciate some help :)
Thanks