0

I have a very large text file containing mostly numerical data with headers (strings). It has 13 columns and ~49,000 rows. All cells contain a numerical value (no empty cells, no columns with different # of rows). It is data about the solar wind taken from a satellite. It looks like this:

year,day,hr,min, sec,  Np,     Tp,          Vx_gsm,    Vy_gsm,  Vz_gsm,   Bx_gsm.....
YYYYxDDDxHHxMMxSSSSSSSxNNNNNNNxTTTTTTTTTTTTxVVVVVVVVVVxVVVVVVVVxVVVVVVVVVxB
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 .....

Now, the columns are actually perfectly aligned in the text file but I couldn't include all 13 columns here but that doesn't matter for now. Wherever there is a "-9999.90039", that means the value was corrupted for some reason. What I need to do is replace the "-9999.90039" values with NaN, so they won't be included in any calculations. I need to create arrays of the columns of cell data for Np and Tp (for now), initialize them to zero, and find the minimum and maximum values in each. This is a large file so I figured I needed to do this in blocks. Also, since I only need to access the values in the numerical cells of 1 column (per calculation) then I didn't need to change much and use textscan. This is what I've gotten so far:

 N = 10;% block size will have to be bigger but I wanted to test first
solarmax = fopen('ACE_magswe_64sec_2000.txt','r');
formatSpec = '%*d %*d %*d %*d %f %f %f %f %f %f %f %f %f';
% my prof said I need to initialize my variables by setting them to zero so
% I put this line below but I don't think it's right
k = 0;% not sure if this is necessary
while ~feof(solarmax)
    k = k+1;% not sure if this is necessary
    C = textscan(solarmax,formatSpec,N,'HeaderLines',2,'Delimiter','\t');
    function y = changeval(num)
    if (num('-9999.90039',num))
        y = 'NaN';
    end
end
fclose(solarmax);
Np = C{1,6};% not sure what to put here to call all values in that column
min(Np)
max(Np)
Tp = C{1,7};% same problem here
min(Tp)
max(Tp)

So, I put asterisks next to those columns I want ignored in FormatSpec and I used HeaderLines to ignore the first 2 rows. After, this I get confused on how I'm supposed to set this up (my only previous experience in programming was C++ in 2006!) Please help!!

UPDATED

Marcin offered me some great advice but I still have some issues that I need help with. Here is my code, now:

N = 1000;
solarmax = fopen('ACE_magswe_64sec_2000.txt','r+');
formatSpec = '%*d %*d %*d %*d %*f %f %f %f %f %f %f %f %f';

minNp = [];
maxNp = [];

minTp = [];
maxTp = [];

while ~feof(solarmax)
    C = textscan(solarmax,formatSpec,N,'HeaderLines',2,'Delimiter','\t');
    Np = cell2mat(C(:,1)); 
    Tp = cell2mat(C(:,2));

    Np(Np == -9999.90039) = NaN;
    Tp(Tp == -9999.90039) = NaN;

    minNp(end+1) = nanmin(Np);
    maxNp(end+1) = nanmax(Np); 

    minTp(end+1) = nanmin(Tp);
    maxTp(end+1) = nanmax(Tp);
end
fclose(solarmax);
nanmin(Np);
nanmax(Np);
nanmin(Tp);
nanmax(Tp);

When compiled and deleting the semicolons at the very end so I get values for the min and max functions, the values turn out to all be NaNs! I thought that the min and max commands already ignored the NaNs so I looked it up and nanmin/nanmax was suggested. However, this yielded the same results. Is there anything else I'm missing?

GreySky
  • 1
  • 3

1 Answers1

0

Maybe this will be helpful.

Example data based on your question. I made it longer so to have more than 10 rows.

year,day,hr,min, sec,  Np,     Tp,          Vx_gsm,    Vy_gsm,  Vz_gsm
YYYYxDDDxHHxMMxSSSSSSSxNNNNNNNxTTTTTTTTTTTTxVVVVVVVVVVxVVVVVVVVxVVVVVVVVVxB
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -17.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -67.01060 
2000   1  0  1 47.0496 3.42400 -9999.90039  -9999.90039 49.72259 -67.01060
2000   1  0  0 43.0272 3.27400 289450.00000 -674.22809 49.72259 -267.01060 
2000   1  0  1 47.0496 3.42400 -99.90039  -9999.90039 49.72259 -167.01060

Modified code based on the example data and your description:

N = 10;% block size will have to be bigger but I wanted to test first

solarmax = fopen('ACE_magswe_64sec_2000.txt','r');

formatSpec = '%*d %*d %*d %*d %f %f %f %f %f %f';

minNp = []; % to store min values for each iteration of the while loop
maxNp = []; % to store max values for each iteration of the while loop

while ~feof(solarmax)    

    % read N=10 rows from the file
    C = textscan(solarmax,formatSpec, N, 'HeaderLines',2,'Delimiter','\t');

    % get Np and Tp column as row vectors
    Np = cell2mat(C(:,2)); 
    Tp = cell2mat(C(:,3)); 

    % chage -9999.90039 to NaN
    Np(Np == -9999.90039) = NaN;
    Np(Np == -9999.90039) = NaN;

    % calculate min or max values for each set ot N=10 values as you
    % did. Probably need to store them, so do this:        
    minNp(end+1) = min(Np);
    maxNp(end+1) = max(Np);        

    % the same do for Tp.
end
fclose(solarmax);

Results in one min and max value for each iteration of loop:

minNp =

    3.2740    3.2740    3.2740


maxNp =

    3.4240    3.4240    3.4240
Marcin
  • 215,873
  • 14
  • 235
  • 294
  • This is great. It's straightforward and I like that. So, I tried it and I'm thinking there must be something wrong with my file because I got this error: Error using feof Invalid file identifier. Use fopen to generate a valid file identifier. This was also the error I got when I had what I thought was correct before. What do you think? – GreySky May 26 '14 at 03:09
  • Here is my code: N = 1000; solarmax = fopen('ACE_magswe_64sec_2000.txt','r'); formatSpec = '%*d %*d %*d %*d %*f %f %f %f %f %f %f %f %f'; minNp = []; maxNp = []; minTp = []; maxTp = []; while ~feof(solarmax) C = textscan(solarmax,formatSpec,N,'HeaderLines',2,'Delimiter','\t'); Np = cell2mat(C(:,1)); Tp = cell2mat(C(:,2)); minNp(end+1) = min(Np); maxNp(end+1) = max(Np); minTp(end+1) = min(Tp); maxTp(end+1) = max(Tp); end fclose(solarmax); – GreySky May 26 '14 at 03:12
  • @GreySky Glad to help. About your error have a look [here](http://stackoverflow.com/questions/10606373/what-causes-an-invalid-file-identifier-in-matlab) for possible causes. – Marcin May 26 '14 at 03:23
  • Okay, so I fixed that error. Silly me, I needed to add the file to the path. So, this seems to work except that the min and max values are all -9999.9! not sure how the max values are the same as the min. Could it have something to do with the negative sign? Also, I needed to find a way to replace those -9999.9 values with NaN. Should I use a separate while or if loop beforehand and export to a new file and then use that new file for calculations afterwards? – GreySky May 26 '14 at 04:13
  • @GreySky In the example provided the `-9999.90039` are replaced to `NaN`. Maybe they don't get replaced as expected in your case due to [how float numbers work](http://stackoverflow.com/questions/686439/why-is-24-0000-not-equal-to-24-0000-in-matlab). – Marcin May 26 '14 at 04:19
  • Okay, so I replaced my -9999.90039s with NaNs but then when I compute my min/max for Np and Tp, the answer is Nan! I looked up what to do and many people said to use nanmin and nanmax, even though min and max should already ignore the NaNs. I read about the float numbers and the problems with precision and that was very useful information... and then this happened. Any suggestions? – GreySky Jun 02 '14 at 00:30
  • Some data must be reading as inf although it's not. Attempting isfinite command. @Marcin Thank you for your time – GreySky Jun 05 '14 at 23:46