0

I have a text output from a program with a set format. I need to parse ~200 of them to extract an information. I tried in MATLAB with 'textscan' but did not work. Following is the input:

MOTIFS SUMMARY:

1)  TTATAGCCGC  (GCGGCTATAA)    1.986
2)  AAACCGCCTC  (GAGGCGGTTT)    1.865

DETAILED RESULTS:

1)  TTATAGCCGC  (GCGGCTATAA)    1.986

Matrix: MAT1    TTATAGCCGC
A   0.1249  0.177   0.7364  0.1189  0.7072  0.1149  0.09858 0.1096
C   0.0899  0.07379 0.1136  0.1298  0.08662 0.1293  0.7528  0.721
G   0.06828 0.1284  0.07195 0.1031  0.1352  0.6708  0.05556 0.0713
T   0.7169  0.6209  0.07802 0.6482  0.07096 0.08492 0.09305 0.09804

OCCURRENCES:
>GENE_1  1  TTATAGCCGC  1   561 +
>GENE_2  24 TAATAGCCGC  0.928699    762 -
>GENE_3  10 ATATAGCCGC  0.904905    185 -
>GENE_1  7  TTATAGCAGC  0.901785    726 +
**********

2)  AAACCGCCTC  (GAGGCGGTTT)    1.865

Matrix: MAT2    AAACCGCCTC
A   0.653   0.7401  0.7763  0.1323  0.09619 0.09134 0.07033 0.1383  
C   0.1163  0.07075 0.09441 0.749   0.6347  0.1132  0.6559  0.6982
G   0.09136 0.09402 0.07385 0.04209 0.1799  0.7332  0.1241  0.07568
T   0.1393  0.09518 0.05541 0.07659 0.08921 0.06234 0.1497  0.08786

OCCURRENCES:
>GENE_1  21 AAACCGCCTC  1   963 +
>GENE_2  14 AAACGGCCTC  0.928198    212 +
>GENE_2  8  AAACCGTCTC  0.92009 170 +
>GENE_4  3  TAACCGCCTC  0.918883    370 +
**********

I am trying to count the unique() occurrence under each motif and add it to the MOTIF SUMMARY and a final average of them. My expected output is:

MOTIFS SUMMARY:

    1)  TTATAGCCGC  (GCGGCTATAA)    1.986   3
    2)  AAACCGCCTC  (GAGGCGGTTT)    1.865   3
AVERAGE OCCURRENCE: 3

For motif 1, unique occurrence is 3 (GENE_1, GENE_2, GENE_3). Similarly for motif 2, it is again 3 (GENE_1, GENE_2, GENE_4)

How can I use OCCURRENCES and ****** as blocks ? so that, I can regexp GENE_x to store it and count.

Kindly help.

Thanks,

AP

Arun
  • 649
  • 8
  • 24
  • 1
    What inputs did you use for `textscan` that didn't work? Since it's a relatively complicated format, you'll need to have a decent amount of code to handle it. You can't expect `textscan` to just magically be able to understand all of it. – Suever Jan 28 '17 at 21:30
  • Is simplifying the text files an option? It seems awfully complicated. Anyhow, I suggest using `textscan()` to get information in a convenient format (e.g. in a structure) and **then** processing the extracted information. – Mohammadreza Khoshbin Jan 29 '17 at 15:25

1 Answers1

0

You better try to change the original text file so that it will be legal matlab m file code, then just use 'eval' function to run it . Most of the job will be to find where to insert '=' and '[' ']' and '%' for ignore parts. If all files are identical in format than it will be easy.

Mendi Barel
  • 3,350
  • 1
  • 23
  • 24