1

I'm trying to analyze a large text log file (11 GB). All of the data are numerical values, and a snippit of the data are listed below.

-0.0623 0.0524  -0.0658 -0.0015 0.0136 -0.0063  0.0259  -0.003  
-0.0028 0.0403  0.0009  -0.0016 -0.0013 -0.0308 0.0511  0.0187  
0.0894  0.0368  0*0243  0.0279  0.0314  -0.0212 0.0582  -0.0403 //<====row 3, weird ASCII char * is present
-0.0548 0.0132  0.0299  0.0215  0.0236  0.0215  0.003   -0.0641 
-0.0615 0.0421  0.0009  0.0457  0.0018  -0.0259 0.041   0.031   
-0.0793 0.01  //<====row 6, the data is misaligned here
0.0278  0.0053  -0.0261 0.0016  0.0233  0.0719  
0.0143  0.0163  -0.0101 -0.0114 -0.0338 -0.0415 0.0143  0.129
-0.0748 -0.0432 0.0044      0.0064  -0.0508 0.0042  0.0237  0.0295      
0.040   -0.0232 -0.0299 -0.0066 -0.0539 -0.0485 -0.0106 0.0225  

Every set of data consists of 2048 rows, and each row has 8 columns.

Here comes the problem: when the data is transformed from binary files to text files using the logging software, a lot of the data are distorted. Take the data above as an example, row 3 column 3 there is a " * " present in the data. And in row 6, one row of data is broken into two rows, one row has 2 data and the other row has 6 data.

I am currently struggling reading this large text files using MATLAB. Since the file itself is so large, I can only use textscan to read the data.

for example:

C = textscan(fd,'%f%f%f%f%f%f%f%f',1,'Delimiter','\t    ');

However, I cannot use '%f' as format since there contains several weird ASCII characters such as " * " or " ! " in the data. These distorted data cannot be treated as floating point numbers. So I choose to use:

C = textscan(fd,'%s%s%s%s%s%s%s%s',1,'Delimiter','\t    ');

and then I transfer those strings into doubles to be processed. However, this encounters the problem of broken lines. When it reaches row 6, it gives:

[-0.0793],[0.01],[],[],[],[],[],[];
[0.0278],[0.0053],[-0.0261],[0.0016],[0.0233],[0.0719],[0.0143],[0.0163];

while it is supposed to look like:

-0.0793 0.01 0.0278 0.0053  -0.0261 0.0016  0.0233  0.0719  ===> one set
0.0143  0.0163  -0.0101 -0.0114 -0.0338 -0.0415 0.0143  0.129 ===> another set

Then the data will be offset by one row and the columns are messed up.

Then I try to do:

C = textscan(fd,'%s',1,'Delimiter','\t  ');

to read one element at one time. If this element is NaN, it will textscan the next one until it sees something other than NaN. Once it obtains 2048 non-empty elements, it will store those 2048 data into a matrix to be processed. After being processed, this matrix is cleared.

This method works well for the first 20% of the whole file....BUT,

since the file itself is 11GB which is very large, after reading about 20% of the file, MATLAB shows:

Error using ==> textscan
Out of memory. Type HELP MEMORY for your options.

(some people suggest using %f while doing textscan, but it won't work because there are some ASCII chars which are causing problem)

Any suggestions to deal with this file?

EDIT: I have tried:

C = textscan(fd,'%s%s%s%s%s%s%s%s',2048,'Delimiter','\t '); 

Although the result is incorrect due to the misalignment of data (like row 6), this code indeed does not cause the "Out of memory" problem. Out of memory problem only occurs when I try to use

C= textscan(fd,'%s',1,'Delimiter','\t '). 

to read the data one entry by one entry. Anyone has any idea why this memory problem happens?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Shawn Sun
  • 11
  • 2
  • Do you have access to the original binary files that you mentioned? If so do you know ... or can you find out their layout. You would probably have much better luck reading directly from binary if possible. – Aero Engy Jul 15 '15 at 00:31
  • Could you preprocess the file? Seems like you could write a couple algorithms to replace * with . and fix the extra new line characters. Then it would be much easier to process. – nalyd88 Jul 15 '15 at 01:34
  • For Aero: Yes I could access those binary files, but I don't think I can decode it since the original data logging software's source code is confidential and the binary file is designed to be read using that software. For Nalyd88: there are numerous fancy chars among those data except " * ", "!", "?","&*^*&%*^$^%%(&^*(*" ..... yup anything just randomly pop out, there's no way to preprocess that. – Shawn Sun Jul 15 '15 at 20:02

3 Answers3

0

This might seem silly, but are you preallocating an array for this data? If the only issue (as it seems to be) with your last function is memory, perhaps

C = zeros(2048,8);

will alleviate your problem. Try inserting that line before you call textscan. I know that MATLAB often exhorts programmers to preallocate for speed; this is just a shot in the dark, but preallocating memory may fix your issue.

Edit: also see this MATLAB Central discussion of a similar issue. It may be that you will have to run the file in chunks, and then concatenate the arrays when each chunk is finished.

  • Thanks for your advice, but I have tried to split the file into chunks, but due to the misalignment of the raw data file (like row 6), I really had a hard time to find the correct split boundary for each 2048*8 entries. That's why I need to textscan '%s' one entry at a time to gather exact 2048 non-empty elements. But I am now trying to preallocate the "C" before each textscan, then I will observe whether this problem could be fixed. – Shawn Sun Jul 14 '15 at 21:07
0

Try something like the code below. It preallocates space and reads numRow* numColumns from the textfile at a time. If you can initialize the bigData matrix then it shouldn't run out of memory ... I think.

Note: That I used 9 for #rows since your sample data had 9 complete rows you will want to use 2024 I presume. This also might need some end of file checks etc. and some error handling. Also any numbers w/ odd ascii text in the will turn into NaN.

Note 2: This still might not work or be very very slow. I had a similar problem reading large text files (10-20GB) that were slightly more complicated. I had to abandon reading them in Matlab. Instead I used Perl for an initial pass which output to binary. Then used Matlab to read the binary back into data. The 2 step approach ended up saving lots and lots of runtime. Link in case you are interested

function bigData = readData(fileName)
fid = fopen(fileName,'r');
numBlocks = 1;  %Somehow determine # of blocks??? not sure if you know of a way to determine this
r = 9; %Replace 9 with your size 2048
c = 8;

bigData = zeros(r*numBlocks,8);
for k = 1:numBlocks
    [dataBlock, rFlag] = readDataBlock(fid,r,c);
    if rFlag
        %Or some kind of error.
        break
    end
    bigData((k-1)*r+1:k*r,:) = dataBlock;
end
fclose(fid);

function [dataBlock, rFlag]= readDataBlock(fid,r,c)

C= textscan(fid,'%s',r*c,'Delimiter','\t '); %replace 9*8 by the size of the data block.
dataBlock = [];
if numel(C{1}) == r*c
    dataBlock = reshape(str2double(C{1}),9,8);
    rFlag = false;
else
    rFlag = true;
   % ?? Throw an error or whatever is appropriate  
end
Community
  • 1
  • 1
Aero Engy
  • 3,588
  • 1
  • 16
  • 27
  • I really appreciate your help, I did not use your code, but I borrowed your idea to read a chunk of data first, then traverse those data to pick the good 2048 ones. Hmmm still got some bugs, but I think the memory problem should be fixed. Thanks again. – Shawn Sun Jul 15 '15 at 20:03
0

While I don't really know how to solve your problems with the broken data, I can give some advice how to process big text data. Read it in batches of multiple lines and write the output directly to the hard drive. In your case the second might be unnecessary, if everything is working you could try to replace data with a variable.

The code was originally written for a different purpose, I deleted the parser for my problem and replaced it with parsed_data=buffer; %TODO;

outputfile='out1.mat';
inputfile='logfile1';
batchsize=1000; %process 1000 lines at once
data=matfile(outputfile,'writable',true); %Simply delete this line if you dant "data" to be a variable in your memory
h=dir(inputfile);
bytes_to_read=h.bytes;
data.out={};
input=fopen(inputfile);
buffer={};

while ftell(input)<bytes_to_read
    buffer=[buffer,textscan(input,'%s',batchsize-numel(buffer))];
    parsed_data=buffer; %TODO;
    data.out(end+1,1)={parsed_data};
    buffer={}; %In the default case your empty your buffer her.
    %If a incomplete line read here partially, leave it in the buffer.
end

fclose(input);
Daniel
  • 36,610
  • 3
  • 36
  • 69