Conditional text import or import by header name - MATLAB

Question

Is there a way to perform conditional text import within MATLAB? e.g. with a tab-delimited .txt file in this format:

Type    A   B   C   D   E
 A    5000  2   5   16  19
 A    5000  3   4   5   4
 A    5000  4   1   4   5
 B    500   19  8   2   7
 B    500   18  9   8   1
 B    500   2   9   13  2
 B    100   3   10  15  9
 B    5000  4   15  14  10

Is there a method to import only those lines where Column A contains '5000'?

This is preferential over importing the entire .txt file and separating the data afterward as in reality, my text files are rather large (~200MB each) - but if there is a way to do this quickly, that would also be a suitable solution.

Alternatively, is there a method (similar to R) where you can import and handle data using the headers contained in the .txt file? e.g. importing 'Type' 'A' 'B' and 'D' whilst ignoring 'C' and 'E' in the above example. This is needed if the input file is flexible in format with additional columns added sometimes meaning their relative positions change.

There is no built-in method that will do these things without you telling it more details about the file structure/format. Matlab routines are not nearly as automated as those in R. Both of the things you ask about can be done directly via `textscan`, but you'll need to make some assumptions about your file and/or provide additional knowledge about its makeup (e.g., length of header, ordering of columns). Otherwise you'll need to add the smarts by creating your own higher-level routine. — horchler, Aug 21 '15 at 16:18

il_raffa · Accepted Answer · 2015-08-21T16:49:32.860

You might try reading the input file line by line, check if the line contains the reference value (5000 in this case) in the reference column (column 2 in this case).

If so you can store the input, otherwise, you discard it.

In the following code, based on your template, you can define the reference value and the reference column at the beginning of the code.

You can then convert cellarray output to array

% Define the column index
col_idx=2
% Define the reference value
ref_value=5000
% Open input file
fid=fopen('in.txt');
% Read header
tline = fgetl(fid);
% Initialize conter
cnt=0;
% Initialize output variable
data=[];
% Read the file line by line
while 1
   % Read the line
   tline = fgetl(fid);
   % Check for the end of file
   if ~ischar(tline)
      break
   end
   % Get the line field
   c=textscan(tline,'%c%f%f%f%f%f')
   % If the seconf field contains the ref value, then store the inout data
   if(c{col_idx} == ref_value)
      data=[data;c]
   end
end
fclose(fid);
% Convert cell 2 array
c=data(:,2:end)
num_data=cell2mat(c)
% Convert first column to char
lab=char(data(:,1))

Hope this helps.

Chris Taylor · Answer 2 · 2015-08-31T10:57:37.930

The function fgetl is used to read a single line from a text file, so one option would be to write a loop which continually reads a single line using fgetl and checks if the first column contains "5000" before deciding whether to include it in your data set or not.

This is the solution presented in il_raffa's answer. Notice that you actually have to read the entire file anyway, since you read the entire line with fgetl and then use textscan on it! So it certainly won't be any faster than reading the entire file and then filtering it (though it may be more memory-efficient).

Really what you want is to read the file character by character, aborting each line if you can determine that you won't be reading it, based on the value of the "A" column.

If you were writing C or another low-level language this would probably be faster than importing the entire file and filtering it afterward. However, because of the overhead introduced by MATLAB it will almost certainly be faster and easier to read the entire file and filter it later. The textscan function is pretty good (and speedy) at reading delimited files, and 200MB is really not that large (it fits comfortably into memory on any modern computer, for example). You should just make sure to filter each data set after reading it, rather than reading all data sets and then filtering them all.

To the second part of your question, regarding whether you can selectively import columns - MATLAB doesn't provide a built-in way to do this. However, it isn't that tricky, if you can make a few assumptions about your file format. If we assume that

The file is in comma or tab delimited format
It has a header line

Then you can read the header line (using fgetl) which will tell you how many columns there are, and what their names are. You can then use that information to build a call to textscan which will read the delimited columns, and filter out the ones whose headers don't match what you need. A simple version of this might look like -

function columns = import_columns(filename, headers)

  fid = fopen(filename);
  hdr = fgetl(fid);
  column_headers = regexp(hdr, '\t', 'split'); % split on tabs

  num_cols = length(column_headers);          
  format_str = repmat('%s', 1, num_cols); % create a string like '%s%s%s%s'
  columns = textscan(fid, format_str, 'Delimiter', '\t');
  fclose(fid);

  required_cols = ismember(column_headers, headers);
  columns(~required_cols) = []; % remove the columns you don't need

end

Conditional text import or import by header name - MATLAB

2 Answers2