I have implemented a data Import in matlab to load very big *.DBF
-files into my work space. Now I'm trying to validate, that the data I imported is the same as the original data. My idea is to just count the number of characters in the imported cell Array and compare it to the number, that Notepad ++ Counts when using view->summary.
To Import the files in matlab I used the following code:
fid = fopen(fullFileName,'r','n','UTF-8'); % used the UTF-8 Option, because otherwise matlab wouldn't recognise german characters like ä,ö,ü
formatSpec=repmat('%s ',1,numberOfColumns); % numberOfColumns is 62 in my chase
data = textscan(fid,formatSpec,'Delimiter','|');
fclose(fid);
data=horzcat(data{:});
Now to count the number of characters I used the following code:
numberOfCharacters=sum(sum(cellfun(@length,data)))+size(data,1)+size(data,1)*(size(data,2)-1);
Here the first summand is the number of characters in each cell. I had to add the second summand because Notepad Counts the line breaks as characters. The third summand is number of delimiters that Notepad also Counts.
Now the results will be
19.489.252 in Notepad and
19.485.889 in Matlab
As you can see, the difference is pretty small compared to the amount of characters used. Still I Need to know what could be the cause of this.
One Thing I already checked is the number of non-ASCII characters in Notepad++ using this answer. Non-ASCII characters are counted correctly.
Unfortunatly I can't provide the data for you to test. So for an answer I would be happy about any Suggestion what could cause the difference in character Counts. Another method of proving that the data that matlab imports is the same as the original data would be welcome, too.