MatLab Missing data handling in categorical data

Question

I am trying to put my dataset into the MATLAB [ranked,weights] = relieff(X,Ylogical,10, 'categoricalx', 'on') function to rank the importance of my predictor features. The dataset<double n*m> has n observations and m discrete (i.e. categorical) features. It happens that each observation (row) in my dataset has at least one NaN value. These NaNs represent unobserved, i.e. missing or null, predictor values in the dataset. (There is no corruption in the dataset, it is just incomplete.)

relieff() uses this function below to remove any rows that contain a NaN:

function [X,Y] = removeNaNs(X,Y)
% Remove observations with missing data
NaNidx = bsxfun(@or,isnan(Y),any(isnan(X),2));
X(NaNidx,:) = [];
Y(NaNidx,:) = [];

This is not ideal, especially for my case, since it leaves me with X=[] and Y=[] (i.e. no observations!)

In this case:

1) Would replacing all NaN's with a random value, e.g. 99999, help? By doing this, I am introducing a new feature state for all the predictor features so I guess it is not ideal.

2) or is replacing NaNs with the mode of the corresponding feature column vector (as below) statistically more sound? (I am not vectorising for clarity's sake)

function [matrixdata] = replaceNaNswithModes(matrixdata)

for i=1: size(matrixdata,2)
cv= matrixdata(:,i);
modevalue= mode(cv);
cv(find(isnan(cv))) = modevalue;
matrixdata(:,i) = cv;
end

3) Or any other sensible way that would make sense for "categorical" data?

P.S: This link gives possible ways to handle missing data.

My first question would be why are `NaN`s appearing in your data? Is this a corruption of the data set, or is this an explainable phenomenon? — macduff, Mar 06 '12 at 17:24
It is a manually entered dataset and NaNs are due to omission by the personnel who enter the data. There is no corruption in the dataset; it is, however, sparse. — Zhubarb, Mar 06 '12 at 18:15

score 1 · Answer 1 · answered Apr 22 '14 at 10:50

I suggest to use a table instead of a matrix. Then you have functions such as ismissing (for the entire table), and isundefined to deal with missing values for categorical variables.

T = array2table(matrix);
T = standardizeMissing(T);  % NaN is standard for double but this 
                            % can be useful for other data type
var1 = categorical(T.var1);
missing = isundefined(var1);
T = T(missing,:);           % removes lines with NaN
matrix = table2array(T);

score 0 · Answer 2 · answered Jul 31 '13 at 19:22

0

You can take a look at this page http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html the firs a1a, it says transforming categorical into binary. Could possibly work. (:

answered Jul 31 '13 at 19:22

Pedro.Alonso

1,007
3
20
41

score 0 · Answer 3 · 2012-03-19T14:29:35.627

0

For a start both solutiona (1) and (2) do not help you handle your data more properly, since NaN is in fact a labelling that is handled appropriately by Matlab; warnings will be issued. What you should do is:

Handle the NaNs per case
Use try catch blocks

NaN is like a number, and there is nothing bad about it. Even is you divide by NaN matlab will treat it properly and give you a NaN.

If you still want to replace them, then you will need an assumption that holds. For example, if your data is engine speeds in a timeseries that have been input by the engine operator, but some time instances have not been specified then there are more than one ways to handle the NaN that will appear in the matrix.

Replace with 0s
Replace with the previous value
Replace with the next value
Replace with the average of the previous and the next value and many more.

As you can see your problem is ill-posed, and depends on the predictor and the data source.

In case of categorical data, e.g. three categories {0,1,2} and supposing NaN occurs in Y.

for k=1:size(Y,2)
  [ id ]=isnan(Y(:,k);
  m(k)=median(Y(~id),k);
  Y(id,k)=round(m(k));
end

I feel really bad that I had to write a for-loop but I cannot see any other way. As you can see I made a number of assumptions, by using median and round. You may want to use a threshold depending on you knowledge about the data.

edited Mar 19 '12 at 14:29

answered Mar 12 '12 at 01:14

1

Hi, Your suggestions (1-4) are great for real / continuous data. I have specified data being 'categorical' (i.e. nominal, not even ordinal) to emphasise that simple interpolation or smoothing does not cut the cheese in this case. Can you elaborate on your suggestion: 'Handle NaNs per case' ? – Zhubarb Mar 19 '12 at 13:56
@Berkan, Hi, it is not clear to me from the description how the NaN occurs is your case. But usually it happens in 0/0 , inf/inf, or if your input data has missing values. I suppose in your case it is 0/0, inf/inf. The reason that I am not suggesting something is because you don't give enough details on the predictor. Though, It is safe though to assume that you know more on that. One possible policy is to take the median for each column (without the nans) and then replace NaN with the median at each column. Another would be to take the mean, or put random values; depends on the predictor. – Mar 19 '12 at 14:16
Sorry, I should have clarified it in the very beginning (I added it to the body of the question now and I am changing the title of the question as well). In my case, NaNs represent missing (unobserved) values in the dataset. I have looked into the option of replacing them with Mean/Median/Mode-Imputation but have a feeling that it is not good. I have also read in a couple of places that it is bad practice. One reference suggests using Maximum Likelihoods or Multiple Imputations but I am still trying to get a grasp of it. – Zhubarb Mar 19 '12 at 15:00
@Berkan mean is MLE when the data is iid gaussian. Though if you have that knowledge, you don't need a predictor. I doubt, though that this is a programming matter. Thing is you must replace it, or handle, and that is up to you to decide. How you do it though is in the code above, you can just replace `median` appropriately. – Mar 19 '12 at 15:47

score 0 · Answer 4 · edited May 23 '17 at 12:27

0

I think the answer to this has been given by gd047 in dimension-reduction-in-categorical-data-with-missing-values:

I am going to look into this, if anyone has any other suggestions or particular MatLab implementations, it would be great to hear.

edited May 23 '17 at 12:27

Community

1
1

answered Mar 19 '12 at 15:07

Zhubarb

11,432
18
75
114

MatLab Missing data handling in categorical data

4 Answers4

Linked