3

I am really struggling to work out how I can import CSV data which contains 9 columns and around 400 rows of data into a table in the Matlab workspace. This would be easy if I was allowed to use the built-in toolboxes Matlab has to offer, but instead, I need to try and complete the task using only commands such as fscanf, fopen, etc. The data itself is in mixed formats for each column. Decimal, float, string, etc. I am also allowed to use CSVread, but I have not managed to get that to work since from my understanding CSVread only works with numeric values.

Here is my code:

>> filename = 'datatext.csv'; %The file in CSV format
>> fid = fopen(filename); 
>> headers = fgetl(fid); %using fgetl to remove the headers and put them in a variable

I have used fgetl to skip the headers line of the file and add them to their own variable, however I am not sure where to go from this point in creating the table. Basically I want to achieve a 400 row by 9 columns table in the workspace of Matlab.

Here is a sample of a few lines of what the text file looks like:

18,8,318,150,3436,11,70,1,sampletext
16,8,304,150,3433,12,70,1,sampletext2

I assume I will have to use built-in conversion functions of some of the cells, which I am allowed to do. I have probably missed some important information in order to receive the correct help, but any help anyone has will be much appreciated. Thankyou.

JamesW
  • 43
  • 4
  • 1
    [`readtable`](https://nl.mathworks.com/help/matlab/ref/readtable.html) does not require any toolbox, but if you have to stick to only low level io, you can indeed use `fscanf` or [`textscan`](https://nl.mathworks.com/help/matlab/ref/textscan.html) – rinkert Nov 23 '20 at 15:29
  • 1
    Thankyou, forgot to add that textscan is specifically not allowed, however, I will try and work with fscanf. – JamesW Nov 23 '20 at 15:41

3 Answers3

3

The lowest level functions to read a file are, (as described in Import Text Data Files with Low-Level I/O):


In your case, the input file is ascii, not binary, so right away we can drop the last option (fread).

You're left with fgetl/fgets (used to read a file line by line then parse each line) and fscanf.

You already have two answers using the line by line approach, so I won't detail this one and rather show you how you can use fscanf (since your data are suitable, they are indeed organised in a formated pattern).

The advantage of using fscanf, is that provided you use the right formatSpec parameter, the function will be able to read the whole file in one go, instead of iterating line by line. That is true for all the numeric data in your file. We will have to do a second pass for the text elements in your last column.

Defining the format specifiers:

first lets define your format specifications. We will use a different format for each pass. The first pass will read all the numeric data, but will skip the text fields, while the second pass will do the opposite, ignore all numeric data and read only the text fields. The '*' character is really useful for that when defining a format specifier:

DataFormatSpec = repmat('%d,',1,8) ;
DataFormatSpec = [DataFormatSpec '%*s'] ; % yield:  '%d,%d,%d,%d,%d,%d,%d,%d,%*s'

TextFormatSpec = repmat('%*d,',1,8) ;
TextFormatSpec = [TextFormatSpec '%s'] ; % yield:  '%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%s'

I've used %d for all columns because in your sample data I did not see the variety of numeric types you mentionned. You can easily substitute this for %f if the data requires it, and you can mix both types without problem in the same specifier (as long as they are all numeric). Just use what makes sense for each column.

Armed with that, let's get to the file. The data are in a formated pattern after the headerline, so we first need to get past the header line before calling fscanf. We'll do that as you've done:

%% Open file and retrieve header line
fid = fopen( filein,'r') ;
hdr = fgetl(fid) ;              % Retrieve the header line
DataStartIndex = ftell(fid) ;   % save the starting index of data section

The call to ftell allows us to save the position of the file pointer. After we read the header, the pointer is located right at the beginning of the data section. We save it to be able to rewind the file pointer to the same point for the second reading pass.

Reading the numeric data:

This is done extremely rapidly by fscanf in a simple call:

%% First pass, read the "numeric values"
dataArray = fscanf( fid , DataFormatSpec , [8 Inf]).' ;

Notice the transpose operator .' at the end of the line. This is because fscanf populate the values it reads in a column major order, but the data in the text file are read in a line major order. The transpose operation at the end is just to have an output array the same dimension than it was in the text file.

Now dataArray contains all your numeric data:

>> dataArray
dataArray =
          18           8         318         150        3436          11          70           1
          16           8         304         150        3433          12          70           1

Reading the text data:

This is were it gets slightly more complex. fscanf automatically convert text characters to their ascii value. It's easy enough to convert that back to actual characters (using the function char()). The biggest hurdle is that if we read all the texts fields in one go, they will all appear as a sucession of numbers, but there will be no way to know where each string stops and where the next starts. To overcome that, we'll do a reading line by line, but still using fscanf:

%% Second pass, read the "text" values
fseek(fid,DataStartIndex,'bof') ;   % Rewind file pointer to the start of data section
nRows = size(dataArray,1) ;         % How many lines we'll need to read
textArray = cell(nRows,1) ;         % pre allocate a cell array to receive the text column elements

for iline=1:nRows
    textArray{iline,1} = char( fscanf(fid,TextFormatSpec,1).' ) ;
end

fclose(fid) ;   % Close file

Notice again the use of the transpose operator .', and the use of char(). Now textArray is a cell array containing all your text fields:

>> textArray
textArray = 
    'sampletext'
    'sampletext2'

Regrouping data set:

Personnally, I would keep the two arrays separate as they are the most optimized containers for each type of data (double array for numeric data, and cell array for an array of strings). However, if you need to regroup them in a single data structure, you can either use a cell array:

%% Optional, merge data into cell array
FullArray = [num2cell(dataArray) textArray]
FullArray = 
    [18]    [8]    [318]    [150]    [3436]    [11]    [70]    [1]    'sampletext' 
    [16]    [8]    [304]    [150]    [3433]    [12]    [70]    [1]    'sampletext2'

Or you could use a table:

%% Optional, merge data into a table
T = array2table(dataArray) ;
T.text = textArray ;
T.Properties.VariableNames = [cellstr(reshape(sprintf('v%d',1:8),2,[]).') ; {'text'}] ;

Which gives:

T = 
    v1    v2    v3     v4      v5     v6    v7    v8        text     
    __    __    ___    ___    ____    __    __    __    _____________
    18    8     318    150    3436    11    70    1     'sampletext' 
    16    8     304    150    3433    12    70    1     'sampletext2'

Obviously if you choose the table version, use the variable names you parsed from the header instead of the automatically generated one I used in this example.

Hoki
  • 11,637
  • 1
  • 24
  • 43
2

Using fgetl() Repeatedly to Retrieve Text File Lines and Separating Line Contents Using split()

Not sure if using the split() function after reading in the text file contents is allowed but here is an implementation that uses fgetl() to grab the text file contents line by line using a loop. After retrieving all the lines in using split() with the second argument the delimiter set to a comma , allows the contents to be split into cells. The first for-loop retrieves the contents of the text file line by line and stores it in a string called Lines. The second for-loop splits the strings stored in Lines by the delimiter , allowing the cells to be stored in another string array shown below with the contents separated. Here -1 indicates a false entry retrieved/when the end of the file is reached.

Array Contents

Sample.txt

18,8,318,150,3436,11,70,1,sampletext
16,8,304,150,3433,12,70,1,sampletext2

Script:

Text = "Start";

% open file (read only)
fileID = fopen('Sample.txt', 'r');
%Running for loop till end of file termination "-1"%
Line_Index = 1;
while(Text ~= "-1")
    % read line/row
    Text = string(fgetl(fileID));
    % stopping criterion
    if (Text ~= "-1")
        Lines(Line_Index,1) = Text;
    end
    % update row index
    Line_Index = Line_Index + 1;
end
% close file
fclose(fileID);

[Number_Of_Lines,~] = size(Lines);
Output_Array = strings(Number_Of_Lines,9);


for Row_Index = 1: Number_Of_Lines
    Line = split(Lines(Row_Index,:),',');
    Line = Line';
    Output_Array(Row_Index,:) = string(Line);
end

Ran using MATLAB R2019b

max
  • 3,915
  • 2
  • 9
  • 25
MichaelTr7
  • 4,737
  • 2
  • 6
  • 21
  • 1
    Thank you very much for both answers. Both were extremely helpful and I have applied it to my problem. – JamesW Nov 24 '20 at 09:44
2

Though @MichaelTr7's answer is perfectly fine, I would like to suggest a a bit more elaborate answer including the conversion to types and eventually handing back a table. Note that it also includes pre-allocation of variables. Because MATLAB stores variables a consistent blocks in RAM, it is good to tell it beforehand how large your variable is going to be. (MATLAB actually complains about variables that appear to grow in a loop...)

The solution also builds on fgetl and (later) split + cell2table (this is definitely no low-level function anymore but this might be OK in your case as it does not handle the reading anymore)

% USER-INPUT
FileName = 'Sample.csv';
strType = "%l,%f,%d,%f,%f,%f,%f,%f,%s";
delimiter = ",";



% allocate strings
Data = strings(100,1);

% open file (read only)
fileID = fopen(FileName, 'r');
%Running for loop till end of file termination "-1"%
Line_idx = 1;
while true
    % read line/row
    Line_text = string(fgetl(fileID));
    % stopping criterion
    if (Line_text == "-1")
        break
    end
    
    Data(Line_idx,1) = Line_text;
    % update row index
    Line_idx = Line_idx + 1;
    
    % extend allocation
    if Line_idx > size(Data,1)
        Data = [Data;strings(100,1)]; %#ok<AGROW>
    end
end
% close file
fclose(fileID);
% crop variable/rows
Data = Data(1:Line_idx-1,:);



strType_splt = split(strType,    delimiter);
Num_strType = length( strType_splt );
Num_Data    = length( split(Data(1,:),  delimiter) );

% check number of conversion types
assert( Num_strType == Num_Data, strcat("Conversion format 'strType' has not the same number of values as the file ",FileName,"."))



% allocate cell
C = cell(size(Data,1),Num_strType);

% loop over rows & columns + convert the elements
for r = 1:size(Data,1) % loop over rows
    line = Data(r);
    % split into individual strings
    line_splt = split(line,  delimiter);
    
    for c = 1:Num_strType % loop over columns
        element_str = line_splt(c);
        type = strType_splt(c);
        C{r,c} = convertStr( type, element_str );
    end
end
% create table
T = cell2table(C);




function element = convertStr(type,str)

    switch type
        case "%f" % float = single | convert to double and cast to single
            element = single(str2double(str));
        case "%l" % long float
            element = str2double(str);
        case "%d" % convert to double and cast to integer
            element = int32(str2double(str));
        case "%s"
            element = string(str);
        case "%D" % non-standard: datetime
            element = datetime(str);
        otherwise
            element = {str};
    end
end

This assumes a file Sample.csv, e.g. with the following content:

18,8,318,150,3436,11,70,1,sampletext
16,8,304,150,3433,12,70,1,sampletext2
max
  • 3,915
  • 2
  • 9
  • 25