Text file processing in Matlab

Question

I have a text output from a program with a set format. I need to parse ~200 of them to extract an information. I tried in MATLAB with 'textscan' but did not work. Following is the input:

MOTIFS SUMMARY:

1)  TTATAGCCGC  (GCGGCTATAA)    1.986
2)  AAACCGCCTC  (GAGGCGGTTT)    1.865

DETAILED RESULTS:

1)  TTATAGCCGC  (GCGGCTATAA)    1.986

Matrix: MAT1    TTATAGCCGC
A   0.1249  0.177   0.7364  0.1189  0.7072  0.1149  0.09858 0.1096
C   0.0899  0.07379 0.1136  0.1298  0.08662 0.1293  0.7528  0.721
G   0.06828 0.1284  0.07195 0.1031  0.1352  0.6708  0.05556 0.0713
T   0.7169  0.6209  0.07802 0.6482  0.07096 0.08492 0.09305 0.09804

OCCURRENCES:
>GENE_1  1  TTATAGCCGC  1   561 +
>GENE_2  24 TAATAGCCGC  0.928699    762 -
>GENE_3  10 ATATAGCCGC  0.904905    185 -
>GENE_1  7  TTATAGCAGC  0.901785    726 +
**********

2)  AAACCGCCTC  (GAGGCGGTTT)    1.865

Matrix: MAT2    AAACCGCCTC
A   0.653   0.7401  0.7763  0.1323  0.09619 0.09134 0.07033 0.1383  
C   0.1163  0.07075 0.09441 0.749   0.6347  0.1132  0.6559  0.6982
G   0.09136 0.09402 0.07385 0.04209 0.1799  0.7332  0.1241  0.07568
T   0.1393  0.09518 0.05541 0.07659 0.08921 0.06234 0.1497  0.08786

OCCURRENCES:
>GENE_1  21 AAACCGCCTC  1   963 +
>GENE_2  14 AAACGGCCTC  0.928198    212 +
>GENE_2  8  AAACCGTCTC  0.92009 170 +
>GENE_4  3  TAACCGCCTC  0.918883    370 +
**********

I am trying to count the unique() occurrence under each motif and add it to the MOTIF SUMMARY and a final average of them. My expected output is:

MOTIFS SUMMARY:

    1)  TTATAGCCGC  (GCGGCTATAA)    1.986   3
    2)  AAACCGCCTC  (GAGGCGGTTT)    1.865   3
AVERAGE OCCURRENCE: 3

For motif 1, unique occurrence is 3 (GENE_1, GENE_2, GENE_3). Similarly for motif 2, it is again 3 (GENE_1, GENE_2, GENE_4)

How can I use OCCURRENCES and ****** as blocks ? so that, I can regexp GENE_x to store it and count.

Kindly help.

Thanks,

AP

What inputs did you use for `textscan` that didn't work? Since it's a relatively complicated format, you'll need to have a decent amount of code to handle it. You can't expect `textscan` to just magically be able to understand all of it. — Suever, Jan 28 '17 at 21:30
Is simplifying the text files an option? It seems awfully complicated. Anyhow, I suggest using `textscan()` to get information in a convenient format (e.g. in a structure) and **then** processing the extracted information. — Mohammadreza Khoshbin, Jan 29 '17 at 15:25

score 0 · Answer 1 · answered Jan 29 '17 at 01:20

0

You better try to change the original text file so that it will be legal matlab m file code, then just use 'eval' function to run it . Most of the job will be to find where to insert '=' and '[' ']' and '%' for ignore parts. If all files are identical in format than it will be easy.

answered Jan 29 '17 at 01:20

Mendi Barel

3,350
1
23
24

Using `eval()` is a serious design flaw as it exposes the program to possibly malicious code. – Mohammadreza Khoshbin Jan 29 '17 at 15:21
Using computer in general expose you to malicious code that you also may write. – Mendi Barel Jan 30 '17 at 03:22
See these links: [link1](http://stackoverflow.com/questions/10272522/use-and-implications-of-evalexpression-in-matlab-code) and [link2](http://stackoverflow.com/questions/1832940/is-using-eval-in-python-a-bad-practice) – Mohammadreza Khoshbin Jan 30 '17 at 03:26
1

I dont need examples. its very ez to write script code that will delete your hard-drive. simple example to delete pwd: " a=dir; arrayfun(@(f)delete(f.name), a); " – Mendi Barel Jan 30 '17 at 04:17
And if this line is in the file that you run `eval()` on, you will be in trouble. – Mohammadreza Khoshbin Jan 30 '17 at 06:54
To prove my point, I just deleted all my pwd without using eval() at all. Just copy this line to new script and press F5 :) – Mendi Barel Jan 30 '17 at 07:24

Text file processing in Matlab

1 Answers1