4

I have a large csv file (should be around 1 million lines) with option data with the following structure (content is altered):

secid, date, days, delta, impl_volatility, impl_strike, impl_premium, dispersion, cp_flag, ticker, index_flag, industry_group
100000, 02/05/1986, 60, -80, 0.270556, 74.2511, 5.2415, 0.021514, C, ASC, 0, 481
100000, 03/05/1986, 30, -40, 0.251556, 74.2571, 6.2415, 0.025524, P, ASC, 0, 481

I have successfully imported a test file using the following:

ftest = fopen('test.csv');
C = textscan(ftest,'%f %s %f %f %f %f %f %f %s %s %f %f','Headerlines',1,'Delimiter',',');
fclose(ftest);

However, C is a cell array and this makes it harder to handle the contents of the file in matlab. It would be easier to have it as a "regular" array (pardon me for not knowing the correct nomenclature, I just started working with matlab).

If I output C, I get:

Columns 1 through 6
[2x1 double]    {2x1 cell}    [2x1 double]    [2x1 double]    [2x1 double]    [2x1 double]
Columns 7 through 12
[2x1 double]    [2x1 double]    {2x1 cell}    {2x1 cell}    [2x1 double]    [2x1 double]

So inside the cell array which is C, there are arrays and cell arrays - arrays for numbers and cell arrays for strings. If I try to check element (1,2), I have to use C{1}(2) but if I want to check element (2,2) I have to use C{2}{2}. Ideally, I would like to access both as C(1,2) and C(2,2). The question is, how do I do this?

I have searched for solutions and found cells2mat but it only works if all content is numeric (I think). I found this solution: Convert cell array of cell arrays to matrix of matrices but horzcat retrieves an error, which I believe may occurr due to the same problem.

Thank you in advance for your time.

Community
  • 1
  • 1
sasstudent
  • 187
  • 1
  • 2
  • 11
  • 1
    A cell array is an array that can store values of different types. Here you have a cell array that stores both columns of numbers and strings. [This question](http://stackoverflow.com/questions/13540418/convert-cell-array-of-cell-arrays-to-matrix-of-matrices) assumes that all you have is numerical values, so a cell array is indeed redundant, and it _can_ be converted into a matrix, which is much easier to manipulate. But what do you expect to get when you have strings involved? – Eitan T Jul 15 '13 at 13:08

2 Answers2

5

As you have an array containing both numeric and character data, what you want is impossible (and trust me, it would be impractical, too).

Referencing individual numbers in a numeric array is different from referencing whole strings. There is simply no escaping that, nor should there be: you treat flowers differently from how you treat people (I sure hope so anyway).

In MATLAB, a string is an ordinary array, with the difference that each entry of the array does not represent a number, but a character. Referencing individual characters goes the same as referencing numbers in arrays:

>> a = 'my string'
>> a(4)
ans = 
s
>> a+0  % cast to double to show the "true character" of strings
ans =
    109   121    32   115   116   114   105   110   103

However, textscan assumes (rightfully so) that you don't want to do that, but rather you want to extract whole strings from the file. And whole strings should be referenced differently, to indicate that you mean whole strings and not individual characters.

I think you'll find it all a bit more intuitive if you split the results from textscan up into an ordinary numeric array and a cell array of strings, like so:

% load the data
ftest = fopen('test.csv');
C = textscan(ftest,...
    '%f %s %f %f %f %f %f %f %s %s %f %f',...     
    'collectoutput', true,...
    'Delimiter',',\n');
fclose(ftest);

% split into numeric and char arrays
numeric = [C{[1 3 5]}]
alpha   = [C{[2 4]}]

Referencing data in numeric then follows the same rules as any ordinary array, and referencing the strings in alpha then follows the normal cell-referencing rules (as in alpha{2,1} to get '03/05/1986')

EDIT Based no your comments, you want to do the conversion like this:

% Read the data
fid = fopen('test.csv', 'r');
C = textscan(fid,...
    '%f %s %f %f %f %f %f %f %s %s %f %f',...
    'Headerlines', 1,...
    'Delimiter',',');
fclose(fid);

% Delete 10th element ('ASC')
C(10) = [];

% Mass-convert dates to datenums
C{2} = datenum(C{2}, 'dd/mm/yyyy');

% Map 'P' to 1 and 'C' to 2
map('PC') = [1 2];
C{9} = map([C{9}{:}]).';

% Convert whole array to numeric array
C = [C{:}];
Rody Oldenhuis
  • 37,726
  • 7
  • 50
  • 96
  • Thank you for your answer @Rody. I understand the issue. I am considering transforming date and cp_flag in numbers and ignoring the ticker field. Can you advise me if this is an efficient way of doing so? `ftest = fopen('testfile2.csv'); C = textscan(ftest,'%f %s %f %f %f %f %f %f %s %s %f %f','Headerlines',1,'Delimiter',','); fclose(ftest); for i=1:size(C{9}) C{2}{i} = datenum(C{2}{i}, 'dd/mm/yyyy'); if C{9}{i} == 'P' C{9}{i} = 1; elseif C{9}{i} == 'C' C{9}{i} = 2; end end C{2} = cell2mat(C{2}); C{9} = cell2mat(C{9}); C = [C{[1 2 3 4 5 6 7 8 9 11 12]}]` – sasstudent Jul 15 '13 at 14:25
  • @japa: While your approach is not wrong, I edited in a version that is shorter, faster, and more MATLABy :) – Rody Oldenhuis Jul 15 '13 at 15:27
  • Thank you very much @Rody, it is very fast like this. It processes the 1052480 lines file in roughly 10 seconds. Thank you very much! – sasstudent Jul 15 '13 at 15:51
0

I was having the same issue...I'd prefer to have a 2-D cell array for easy accessibility and utilizing the built-in matlab sorting functions.

Here is another solution that might work for you (this is how TMW does it in their auto-generated code in the import tool). It turns the numerical arrays into cell arrays so that you can concatenate them into a 2-D matrix.

C([1,3,4,5,6,7,8,11,12]) = cellfun(@(x) num2cell(x), C([1,3,4,5,6,7,8,11,12]),'UniformOutput', false);

C = [C{1:end}];
andrewhunter
  • 183
  • 9