I have a text output from a program with a set format. I need to parse ~200 of them to extract an information. I tried in MATLAB with 'textscan' but did not work. Following is the input:
MOTIFS SUMMARY:
1) TTATAGCCGC (GCGGCTATAA) 1.986
2) AAACCGCCTC (GAGGCGGTTT) 1.865
DETAILED RESULTS:
1) TTATAGCCGC (GCGGCTATAA) 1.986
Matrix: MAT1 TTATAGCCGC
A 0.1249 0.177 0.7364 0.1189 0.7072 0.1149 0.09858 0.1096
C 0.0899 0.07379 0.1136 0.1298 0.08662 0.1293 0.7528 0.721
G 0.06828 0.1284 0.07195 0.1031 0.1352 0.6708 0.05556 0.0713
T 0.7169 0.6209 0.07802 0.6482 0.07096 0.08492 0.09305 0.09804
OCCURRENCES:
>GENE_1 1 TTATAGCCGC 1 561 +
>GENE_2 24 TAATAGCCGC 0.928699 762 -
>GENE_3 10 ATATAGCCGC 0.904905 185 -
>GENE_1 7 TTATAGCAGC 0.901785 726 +
**********
2) AAACCGCCTC (GAGGCGGTTT) 1.865
Matrix: MAT2 AAACCGCCTC
A 0.653 0.7401 0.7763 0.1323 0.09619 0.09134 0.07033 0.1383
C 0.1163 0.07075 0.09441 0.749 0.6347 0.1132 0.6559 0.6982
G 0.09136 0.09402 0.07385 0.04209 0.1799 0.7332 0.1241 0.07568
T 0.1393 0.09518 0.05541 0.07659 0.08921 0.06234 0.1497 0.08786
OCCURRENCES:
>GENE_1 21 AAACCGCCTC 1 963 +
>GENE_2 14 AAACGGCCTC 0.928198 212 +
>GENE_2 8 AAACCGTCTC 0.92009 170 +
>GENE_4 3 TAACCGCCTC 0.918883 370 +
**********
I am trying to count the unique() occurrence under each motif and add it to the MOTIF SUMMARY and a final average of them. My expected output is:
MOTIFS SUMMARY:
1) TTATAGCCGC (GCGGCTATAA) 1.986 3
2) AAACCGCCTC (GAGGCGGTTT) 1.865 3
AVERAGE OCCURRENCE: 3
For motif 1, unique occurrence is 3 (GENE_1, GENE_2, GENE_3). Similarly for motif 2, it is again 3 (GENE_1, GENE_2, GENE_4)
How can I use OCCURRENCES and ****** as blocks ? so that, I can regexp GENE_x to store it and count.
Kindly help.
Thanks,
AP