I'm trying to analyze a large text log file (11 GB). All of the data are numerical values, and a snippit of the data are listed below.
-0.0623 0.0524 -0.0658 -0.0015 0.0136 -0.0063 0.0259 -0.003
-0.0028 0.0403 0.0009 -0.0016 -0.0013 -0.0308 0.0511 0.0187
0.0894 0.0368 0*0243 0.0279 0.0314 -0.0212 0.0582 -0.0403 //<====row 3, weird ASCII char * is present
-0.0548 0.0132 0.0299 0.0215 0.0236 0.0215 0.003 -0.0641
-0.0615 0.0421 0.0009 0.0457 0.0018 -0.0259 0.041 0.031
-0.0793 0.01 //<====row 6, the data is misaligned here
0.0278 0.0053 -0.0261 0.0016 0.0233 0.0719
0.0143 0.0163 -0.0101 -0.0114 -0.0338 -0.0415 0.0143 0.129
-0.0748 -0.0432 0.0044 0.0064 -0.0508 0.0042 0.0237 0.0295
0.040 -0.0232 -0.0299 -0.0066 -0.0539 -0.0485 -0.0106 0.0225
Every set of data consists of 2048 rows, and each row has 8 columns.
Here comes the problem: when the data is transformed from binary files to text files using the logging software, a lot of the data are distorted. Take the data above as an example, row 3 column 3 there is a " * " present in the data. And in row 6, one row of data is broken into two rows, one row has 2 data and the other row has 6 data.
I am currently struggling reading this large text files using MATLAB. Since the file itself is so large, I can only use textscan to read the data.
for example:
C = textscan(fd,'%f%f%f%f%f%f%f%f',1,'Delimiter','\t ');
However, I cannot use '%f' as format since there contains several weird ASCII characters such as " * " or " ! " in the data. These distorted data cannot be treated as floating point numbers. So I choose to use:
C = textscan(fd,'%s%s%s%s%s%s%s%s',1,'Delimiter','\t ');
and then I transfer those strings into doubles to be processed. However, this encounters the problem of broken lines. When it reaches row 6, it gives:
[-0.0793],[0.01],[],[],[],[],[],[];
[0.0278],[0.0053],[-0.0261],[0.0016],[0.0233],[0.0719],[0.0143],[0.0163];
while it is supposed to look like:
-0.0793 0.01 0.0278 0.0053 -0.0261 0.0016 0.0233 0.0719 ===> one set
0.0143 0.0163 -0.0101 -0.0114 -0.0338 -0.0415 0.0143 0.129 ===> another set
Then the data will be offset by one row and the columns are messed up.
Then I try to do:
C = textscan(fd,'%s',1,'Delimiter','\t ');
to read one element at one time. If this element is NaN, it will textscan the next one until it sees something other than NaN. Once it obtains 2048 non-empty elements, it will store those 2048 data into a matrix to be processed. After being processed, this matrix is cleared.
This method works well for the first 20% of the whole file....BUT,
since the file itself is 11GB which is very large, after reading about 20% of the file, MATLAB shows:
Error using ==> textscan
Out of memory. Type HELP MEMORY for your options.
(some people suggest using %f while doing textscan, but it won't work because there are some ASCII chars which are causing problem)
Any suggestions to deal with this file?
EDIT: I have tried:
C = textscan(fd,'%s%s%s%s%s%s%s%s',2048,'Delimiter','\t ');
Although the result is incorrect due to the misalignment of data (like row 6), this code indeed does not cause the "Out of memory" problem. Out of memory problem only occurs when I try to use
C= textscan(fd,'%s',1,'Delimiter','\t ').
to read the data one entry by one entry. Anyone has any idea why this memory problem happens?