Can I read a gigantic text file with Parallel Computing?

Question

I have multiple text files that are about 2GB in size (approximately 70 million lines). I also have a quad-core machine and access to the Parallel Computing toolbox.

Typically you might open a file and read lines as so:

f = fopen('file.txt');
l = fgets(f);
while ~ isempty(l)
    % do something with l
    l = fgets(f);
end

I wanted to distribute the "do something with l" across my 4 cores, but that of course requires the use of a parfor loop. That would require that I "slurp" the 2GB file (to borrow a Perl term) into MATLAB a priori, instead of processing on the fly. I don't actually need l, just the result of the processing.

Is there a way to read lines out of a text file with parallel computing?

EDIT: It's worth mentioning that I can find the exact number of lines ahead of time (!wc -l mygiantfile.txt).

EDIT2: The structure of the file is as follows:

15 1180 62444 e0e0 049c f3ec 104

So 3 decimal numbers, 3 hex numbers, and 1 decimal number. Repeat this for 70 million lines.

perhaps you can [split](http://stackoverflow.com/q/2016894/97160) the file into 4 equal parts, and process them using `parfor`, one on each core... I'm thinking this is still IO bound, so you are not going to benefit much by spinning multiple process, unless the "do something with l" part is really CPU intensive. Perhaps even MATLAB is not the best tool for the job, lookup [tag:mapreduce] — Amro, Sep 04 '13 at 17:23
If the file is a long binary vector, you can use `fscan` \ `fread` with `parfor` with giving the proper pointer. I'll try to show an example later... — bla, Sep 04 '13 at 19:56
@natan: here is an example with `fread` to read structured binary file in a vectorized manner: http://stackoverflow.com/a/8108683/97160 — Amro, Sep 04 '13 at 21:46
For those interested, there was a recent article on Loren Shure's blog that shows how to perform big data analysis using [distributed arrays](http://www.mathworks.com/help/distcomp/distributed.html) on a cluster of computers: http://blogs.mathworks.com/loren/2013/11/11/in-memory-big-data-analysis-with-pct-and-mdcs/ — Amro, Nov 24 '13 at 01:25

score 2 · Answer 1 · edited Sep 05 '13 at 16:37

2

Some matlab's built-in functions support multithreading - the list is here. There is no need for the Parallel Computing toolbox.

If the "do something with l" can benefit from the toolbox, just implement the function before reading another line.

You may alternatively want to read the whole file using

fid = fopen('textfile.txt');
C  = textscan(fid,'%s','delimiter','\n');
fclose(fid);

and then compute the cells in C in parallel.

If the reading time is a key issue, you may also want to access parts of the data file within a parfor loop. Here is an example from Edric M Ellis.

%Some data
x = rand(1000, 10);
fh = fopen( 'tmp.bin', 'wb' );
fwrite( fh, x, 'double' );
fclose( fh );

% Read the data
y = zeros(1000, 10);
parfor ii = 1:10
    fh = fopen( 'tmp.bin', 'rb' );
    % Get to the correct spot in the file:
    offset_bytes = (ii-1) * 1000 * 8; % 8 bytes/double
    fseek( fh, offset_bytes, 'bof' );
    % read a column
    y(:,ii) = fread( fh, 1000, 'double' );
    fclose( fh );
end

% Check
assert( isequal( x, y ) );

edited Sep 05 '13 at 16:37

Amro

123,847
25
243
454

answered Sep 04 '13 at 17:25

marsei

7,691
3
32
41

You are suggesting reading the whole file in memory? I dont see how that answers the question. File IO operations will never be multithreaded... And what happens when the file gets bigger than available RAM – Amro Sep 04 '13 at 17:33
@Amro that was my exact thought - I want to avoid "slurping" the entire file ahead of time. – Dang Khoa Sep 04 '13 at 17:42
The RAM issue could easily be solved by using `textscan` with a loop: `for k=1:nTimes, C = textscan(fid,'%s',nLines,'delimiter','\n'); end` Since it will keep the current file position it won't slow the process. – Werner Sep 04 '13 at 17:42
1

@Werner: right, reading in chunks is definitely a viable solution. Another option is [`memmapfile`](http://www.mathworks.com/help/matlab/memory-mapping.html). My point is that all those are not easily parallelized, which is what the OP was asking about. – Amro Sep 04 '13 at 18:02
@Amro nice to hear about this `menmapfile`, didn't know about that solution x). Yeah, I noticed your concern with the parallelization but I can't answer that, hope Magla can enlighten that. I would say that it is pointless to parallelize since the reading speed will be limited to the disk read speed even if the file is separated at `n` different files. I think the best solution would have one thread reading while others are processing the already read information but it would be quite boring to synchronize. – Werner Sep 04 '13 at 18:22
@Amro you ought to write up an answer using `memmapfile`. This might be the best way to go. – Dang Khoa Sep 04 '13 at 18:28
@DangKhoa: it's not always applicable and depends on your file. Is it a structured file or a general ASCII text file? – Amro Sep 04 '13 at 18:44
@Amro It's ASCII, but it's always 7 space-delimited columns of either decimal (cols 1,2,7) or hex (3,4,5,6). – Dang Khoa Sep 04 '13 at 20:16
@DangKhoa: it would help if you post a small sample of the file showing the structure of its records. But if the file is space delimited rather than fixed-length fields, I doubt `memmapfile` would easily apply, at least without first converting the structure of the file to something conformant.. – Amro Sep 04 '13 at 21:44
Nice example! @Amro is quite right that you will only see benefit from PARFOR if what you do with the data takes much longer than the time required to read it from disk. – Edric Sep 05 '13 at 06:05
@Amro see my edit. I'm trying to optimize the data processing right now.. that may be a follow-on question. – Dang Khoa Sep 05 '13 at 15:01
@DangKhoa: I posted an example using `memmapfile`. I started it before you made the recent edit, so you'll have to adapt the code to your data :) – Amro Sep 05 '13 at 16:05
@Magla: I added a link to the newsgroup thread with Edric's example – Amro Sep 05 '13 at 16:39

score 2 · Accepted Answer · answered Sep 05 '13 at 16:04

As requested, I'm showing an example of memory-mapped files using memmapfile class.

Since you didn't provide the exact format of the data file, I will create my own. The data I am creating is a table of N rows, each consisting of 4 columns:

first is a double scalar value
second is a single value
third is a fixed-length string representing a uint32 in HEX notation (e.g: D091BB44)
fourth column is a uint8 value

The code to generate the random data, and write it to binary file structured as described above:

% random data
N = 10;
data = [...
    num2cell(rand(N,1)), ...
    num2cell(rand(N,1,'single')), ...
    cellstr(dec2hex(randi(intmax('uint32'), [N,1]),8)), ...
    num2cell(randi([0 255], [N,1], 'uint8')) ...
];

% write to binary file
fid = fopen('file.bin', 'wb');
for i=1:N
    fwrite(fid, data{i,1}, 'double');
    fwrite(fid, data{i,2}, 'single');
    fwrite(fid, data{i,3}, 'char');
    fwrite(fid, data{i,4}, 'uint8');
end
fclose(fid);

Here is the resulting file viewed in a HEX editor:

binary file viewed in a hex editor

we can confirm the first record (note that my system uses Little-endian byte ordering):

>> num2hex(data{1,1})
ans =
3fd4d780d56f2ca6

>> num2hex(data{1,2})
ans =
3ddd473e

>> arrayfun(@dec2hex, double(data{1,3}), 'UniformOutput',false)
ans = 
    '46'    '35'    '36'    '32'    '37'    '35'    '32'    '46'

>> dec2hex(data{1,4})
ans =
C0

Next we open the file using memory-mapping:

m = memmapfile('file.bin', 'Offset',0, 'Repeat',Inf, 'Writable',false, ...
    'Format',{
        'double', [1 1], 'd';
        'single', [1 1], 's';
        'uint8' , [1 8], 'h';      % since it doesnt directly support char
        'uint8' , [1 1], 'i'});

Now we can access the records as an ordinary structure array:

>> rec = m.Data;      % 10x1 struct array

>> rec(1)             % same as: data(1,:)
ans = 
    d: 0.3257
    s: 0.1080
    h: [70 53 54 50 55 53 50 70]
    i: 192

>> rec(4).d           % same as: data{4,1}
ans =
    0.5799

>> char(rec(10).h)    % same as: data{10,3}
ans =
2B2F493F

The benefit is that for large data files, is that you can restrict the mapping "viewing window" to a small subset of the records, and move this view along the file:

% read the records two at-a-time
numRec = 10;                       % total number of records
lenRec = 8*1 + 4*1 + 1*8 + 1*1;    % length of each record in bytes
numRecPerView = 2;                 % how many records in a viewing window

m.Repeat = numRecPerView;
for i=1:(numRec/numRecPerView)
    % move the window along the file
    m.Offset = (i-1) * numRecPerView*lenRec;

    % read the two records in this window:
    %for j=1:numRecPerView, m.Data(j), end
    m.Data(1)
    m.Data(2)
end

access a portion of a file using memory-mapping

Awesome, thanks! I will have to adapt this for my needs but this is definitely a great start. — Dang Khoa, Sep 05 '13 at 16:09
Just want to remind you that the above works for binary files with fixed-length fields and records (we know in advance how many bytes and the precision of each field). It wont work for space-delimited text files, not unless you use some kind of padding to get all fields aligned. — Amro, Sep 05 '13 at 16:32
Right. I might go back to the source of this file and ask him to make it a binary format for me. — Dang Khoa, Sep 05 '13 at 16:36

Can I read a gigantic text file with Parallel Computing?

2 Answers2